Visuals to Text:A Comprehensive Review on Automatic Image Captioning
Visuals to Text: A Comprehensive Review on Automatic Image Captioning作者机构:Beijing University of Posts and TelecommunicationsBeijing 100876China School of Creative TechnologiesUniversity of PortsmouthPortsmouth PO12DJUK
出 版 物:《IEEE/CAA Journal of Automatica Sinica》 (自动化学报(英文版))
年 卷 期:2022年第9卷第8期
页 面:1339-1365页
核心收录:
学科分类:12[管理学] 1201[管理学-管理科学与工程(可授管理学、工学学位)] 081104[工学-模式识别与智能系统] 08[工学] 080203[工学-机械设计及理论] 0835[工学-软件工程] 0802[工学-机械工程] 0811[工学-控制科学与工程] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:supported by Beijing Natural Science Foundation of China(L201023) the Natural Science Foundation of China(62076030)
主 题:Artificial intelligence attention mechanism encoder-decoder framework image captioning multi-modal understanding training strategies
摘 要:Image captioning refers to automatic generation of descriptive texts according to the visual content of *** is a technique integrating multiple disciplines including the computer vision(CV),natural language processing(NLP)and artificial *** recent years,substantial research efforts have been devoted to generate image caption with impressive *** summarize the recent advances in image captioning,we present a comprehensive review on image captioning,covering both traditional methods and recent deep learning-based ***,we first briefly review the early traditional works based on the retrieval and *** deep learning-based image captioning researches are focused,which is categorized into the encoder-decoder framework,attention mechanism and training strategies on the basis of model structures and training manners for a detailed *** that,we summarize the publicly available datasets,evaluation metrics and those proposed for specific requirements,and then compare the state of the art methods on the MS COCO ***,we provide some discussions on open challenges and future research directions.