Fine-Grained Features for Image Captioning
作者机构:Zhejiang Sci-Tech UniversityHangzhou310020China Zhejiang University of TechnologyHangzhou310020China
出 版 物:《Computers, Materials & Continua》 (计算机、材料和连续体(英文))
年 卷 期:2023年第75卷第6期
页 面:4697-4712页
核心收录:
学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:supported in part by the National Natural Science Foundation of China(NSFC)under Grant 6150140 in part by the Youth Innovation Project(21032158-Y)of Zhejiang Sci-Tech University
主 题:Image captioning region features fine-grained features fusion
摘 要:Image captioning involves two different major modalities(image and sentence)that convert a given image into a language that adheres to visual *** all methods first extract image features to reduce the difficulty of visual semantic embedding and then use the caption model to generate fluent *** Convolutional Neural Network(CNN)is often used to extract image features in image captioning,and the use of object detection networks to extract region features has achieved great ***,the region features retrieved by this method are object-level and do not pay attention to fine-grained details because of the detection model’s *** offer an approach to address this issue that more properly generates captions by fusing fine-grained features and region ***,we extract fine-grained features using a panoramic segmentation ***,we suggest two fusion methods and contrast their fusion *** X-linear Attention Network(X-LAN)serves as the foundation for both fusion *** to experimental findings on the COCO dataset,the two-branch fusion approach is *** is important to note that on the COCO Karpathy test split,CIDEr is increased up to 134.3%in comparison to the baseline,highlighting the potency and viability of our method.