Multimodal Pretraining from Monolingual to Multilingual
作者机构:School of InformationRenmin University of ChinaBeijing 100872China
出 版 物:《Machine Intelligence Research》 (机器智能研究(英文版))
年 卷 期:2023年第20卷第2期
页 面:220-232页
核心收录:
学科分类:12[管理学] 1201[管理学-管理科学与工程(可授管理学、工学学位)] 081104[工学-模式识别与智能系统] 08[工学] 0835[工学-软件工程] 0811[工学-控制科学与工程] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:supported by the National Natural Science Foundation of China(No.62072462) the National Key R&D Program of China(No.2020AAA0108600) the Large-scale Pretraining Program 468 of Beijing Academy of Artificial Intelligence(BAAI)
主 题:Multilingual pretraining multimodal pretraining cross-lingual transfer multilingual generation cross-modal retrieval
摘 要:Multimodal pretraining has made convincing achievements in various downstream tasks in recent ***,since the majority of the existing works construct models based on English,their applications are limited by *** this work,we address this issue by developing models with multimodal and multilingual *** explore two types of methods to extend multimodal pretraining model from monolingual to ***,we propose a pretraining-based model named multilingual multimodal pretraining(MLMM),and two generalization-based models named multilingual CLIP(M-CLIP)and multilingual acquisition(MLA).In addition,we further extend the generalization-based models to incorporate the audio modality and develop the multilingual CLIP for vision,language,and audio(CLIP4VLA).Our models achieve state-of-the-art performances on multilingual vision-text retrieval,visual question answering,and image captioning *** on the experimental results,we discuss the pros and cons of the two types of models and their potential practical applications.