Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval
作者机构:School of Computer Science and TechnologyDalian University of TechnologyDalian 116024China School of Computer ScienceShaanxi Normal UniversityXi’an 710119China School of Computer EngineeringWeifang UniversityWeifang 261061China Guangxi Colleges and Universities Key Laboratory of Intelligent Industry SoftwareWuzhou UniversityWuzhou 543002 China
出 版 物:《Journal of Computer Science & Technology》 (计算机科学技术学报(英文版))
年 卷 期:2024年第39卷第4期
页 面:811-826页
核心收录:
学科分类:1205[管理学-图书情报与档案管理] 08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:supported by the National Natural Science Foundation of China under Grant No.62076048
主 题:image-text retrieval cross-modal retrieval multi-task learning graph convolutional network
摘 要:Image-text retrieval aims to capture the semantic correspondence between images and texts,which serves as a foundation and crucial component in multi-modal recommendations,search systems,and online *** mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text *** this end,a multi-task visual semantic embedding network(MVSEN)is proposed for image-text ***,we design two auxiliary tasks,including text-text matching and multi-label classification,for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training ***,we present an intra-and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between ***,we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text *** results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets,Flickr30K and MSCOCO,with rSum improvements of 8.2%and 3.0%,respectively.