Efficient Image Captioning Based on Vision Transformer Models
作者机构:Department of Data ScienceFaculty of Artificial IntelligenceKafrelsheikh UniversityEgypt Department of Electrical EngineeringFaculty of EngineeringKafrelsheikh UniversityEgypt Department of Computer ScienceFaculty of Computer and Information ScienceMansouraEgypt
出 版 物:《Computers, Materials & Continua》 (计算机、材料和连续体(英文))
年 卷 期:2022年第73卷第10期
页 面:1483-1500页
核心收录:
学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
主 题:Image captioning sequence-to-sequence self-distillation transformer convolutional layer
摘 要:Image captioning is an emerging field in machine *** refers to the ability to automatically generate a syntactically and semantically meaningful sentence that describes the content of an *** captioning requires a complex machine learning process as it involves two sub models:a vision sub-model for extracting object features and a language sub-model that use the extracted features to generate meaningful ***-based vision transformers models have a great impact in vision field *** this paper,we studied the effect of using the vision transformers on the image captioning process by evaluating the use of four different vision transformer models for the vision sub-models of the image captioning The first vision transformers used is DINO(self-distillation with no labels).The second is PVT(Pyramid Vision Transformer)which is a vision transformer that is not using convolutional *** third is XCIT(cross-Covariance Image Transformer)which changes the operation in self-attention by focusing on feature dimension instead of token *** last one is SWIN(Shifted windows),it is a vision transformer which,unlike the other transformers,uses shifted-window in splitting the *** a deeper evaluation,the four mentioned vision transformers have been tested with their different versions and different configuration,we evaluate the use of DINO model with five different backbones,PVT with two versions:PVT_v1and PVT_v2,one model of XCIT,SWIN *** results show the high effectiveness of using SWIN-transformer within the proposed image captioning model with regard to the other models.