咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Efficient Image Captioning Bas... 收藏

Efficient Image Captioning Based on Vision Transformer Models

作     者:Samar Elbedwehy T.Medhat Taher Hamza Mohammed F.Alrahmawy 

作者机构:Department of Data ScienceFaculty of Artificial IntelligenceKafrelsheikh UniversityEgypt Department of Electrical EngineeringFaculty of EngineeringKafrelsheikh UniversityEgypt Department of Computer ScienceFaculty of Computer and Information ScienceMansouraEgypt 

出 版 物:《Computers, Materials & Continua》 (计算机、材料和连续体(英文))

年 卷 期:2022年第73卷第10期

页      面:1483-1500页

核心收录:

学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

主  题:Image captioning sequence-to-sequence self-distillation transformer convolutional layer 

摘      要:Image captioning is an emerging field in machine *** refers to the ability to automatically generate a syntactically and semantically meaningful sentence that describes the content of an *** captioning requires a complex machine learning process as it involves two sub models:a vision sub-model for extracting object features and a language sub-model that use the extracted features to generate meaningful ***-based vision transformers models have a great impact in vision field *** this paper,we studied the effect of using the vision transformers on the image captioning process by evaluating the use of four different vision transformer models for the vision sub-models of the image captioning The first vision transformers used is DINO(self-distillation with no labels).The second is PVT(Pyramid Vision Transformer)which is a vision transformer that is not using convolutional *** third is XCIT(cross-Covariance Image Transformer)which changes the operation in self-attention by focusing on feature dimension instead of token *** last one is SWIN(Shifted windows),it is a vision transformer which,unlike the other transformers,uses shifted-window in splitting the *** a deeper evaluation,the four mentioned vision transformers have been tested with their different versions and different configuration,we evaluate the use of DINO model with five different backbones,PVT with two versions:PVT_v1and PVT_v2,one model of XCIT,SWIN *** results show the high effectiveness of using SWIN-transformer within the proposed image captioning model with regard to the other models.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分