检索结果-南通市图书馆

Learning a Mixture of Conditional Gating Blocks for visual question answering

Journal of Computer Science & Technology 2024年第4期39卷 912-928页

作者： Qiang Sun Yan-Wei Fu Xiang-Yang Xue School of Statistics and Information Shanghai University of International Business and EconomicsShanghai 201620China Academy for Engineering and Technology Fudan UniversityShanghai 200433China School of Data Science Fudan UniversityShanghai 200433China School of Computer Science Fudan UniversityShanghai 200433China

As a Turing test in multimedia,visual question answering(VQA)aims to answer the textual question with a given ***,the“dynamic”property of neural networks has been explored as one of the most promising ways of improving the adaptability,interpretability,and capacity of the neural network ***,despite the prevalence of dynamic convolutional neural networks,it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end ***,due to the large computation cost of transformers,researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language *** this end,we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA *** particular,we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block(cMHSA).Furthermore,our questionguided cMHSA is compatible with conditional ResNeXt block(cResNeXt).Thus a novel model mixture of conditional gating blocks(McG)is proposed for VQA,which keeps the best of the Transformer,convolutional neural network(CNN),and dynamic *** pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of *** quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract *** experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.

关键词： visual question answering Transformer dynamic network

来源：

维普期刊数据库评论

在线全文

维普期刊数据库

学校读者我要写书评

暂无评论

Prompting Large Language Models with Knowledge-Injection for Knowledge-Based visual question answering

引用

Big Data Mining and Analytics 2024年第3期7卷 843-857页

作者： Zhongjian Hu Peng Yang Fengyuan Liu Yuan Meng Xingyu Liu School of Computer Science and Engineering Southeast University the Key Laboratory of Computer Network and Information Integration(Southeast University) Ministry of Education of the People’s Republic of ChinaNanjing 211189China Southeast University-Monash University Joint Graduate School(Suzhou) Southeast UniversitySuzhou 215125China

Previous works employ the Large Language Model(LLM)like GPT-3 for knowledge-based visual question answering(VQA).We argue that the inferential capacity of LLM can be enhanced through knowledge *** methods that utilize knowledge graphs to enhance LLM have been explored in various tasks,they may have some limitations,such as the possibility of not being able to retrieve the required *** this paper,we introduce a novel framework for knowledge-based VQA titled“Prompting Large Language Models with Knowledge-Injection”(PLLMKI).We use vanilla VQA model to inspire the LLM and further enhance the LLM with knowledge *** earlier approaches,we adopt the LLM for knowledge enhancement instead of relying on knowledge ***,we leverage open LLMs,incurring no additional *** comparison to existing baselines,our approach exhibits the accuracy improvement of over 1.3 and 1.7 on two knowledge-based VQA datasets,namely OK-VQA and A-OKVQA,respectively.

关键词： visual question answering knowledge-based visual question answering large language model knowledge injection

来源：

维普期刊数据库评论

在线全文

维普期刊数据库

学校读者我要写书评

暂无评论

Improved Blending Attention Mechanism in visual question answering

引用

Computer Systems Science & Engineering 2023年第10期47卷 1149-1161页

作者： Siyu Lu Yueming Ding Zhengtong Yin Mingzhe Liu Xuan Liu Wenfeng Zheng Lirong Yin School of Automation University of Electronic Science and Technology of ChinaChengdu610054China College of Resource and Environment Engineering Guizhou UniversityGuiyang550025China School of Data Science and Artificial Intelligence Wenzhou University of TechnologyWenzhou325000China School of Public Affairs and Administration University of Electronic Science and Technology of ChinaChengdu611731China Department of Geography and Anthropology Louisiana State UniversityBaton Rouge70803LAUSA

visual question answering(VQA)has attracted more and more attention in computer vision and natural language *** are committed to studying how to better integrate image features and text features to achieve better results in VQA *** of all features may cause information redundancy and heavy computational *** mechanism is a wise way to solve this ***,using single attention mechanism may cause incomplete concern of *** paper improves the attention mechanism method and proposes a hybrid attention mechanism that combines the spatial attention mechanism method and the channel attention mechanism *** the case that the attention mechanism will cause the loss of the original features,a small portion of image features were added as *** the attention mechanism of text features,a selfattention mechanism was introduced,and the internal structural features of sentences were strengthened to improve the overall *** results show that attention mechanism and feature compensation add 6.1%accuracy to multimodal low-rank bilinear pooling network.

关键词： visual question answering spatial attention mechanism channel attention mechanism image feature processing text feature extraction

来源：

维普期刊数据库评论

在线全文

维普期刊数据库

学校读者我要写书评

暂无评论

A survey of deep learning-based visual question answering

引用

Journal of Central South University 2021年第3期28卷 728-746页

作者： HUANG Tong-yuan YANG Yu-ling YANG Xue-jiao School of Artificial Intelligence Chongqing University of TechnologyChongqing 401135China School of Computer Science and Engineering Chongqing University of TechnologyChongqing 400054China

With the warming up and continuous development of machine learning,especially deep learning,the research on visual question answering field has made significant progress,with important theoretical research significance and practical application ***,it is necessary to summarize the current research and provide some reference for researchers in this *** article conducted a detailed and in-depth analysis and summarized of relevant research and typical methods of visual question answering ***,relevant background knowledge about VQA(visual question answering)was ***,the issues and challenges of visual question answering were discussed,and at the same time,some promising discussion on the particular methodologies was ***,the key sub-problems affecting visual question answering were summarized and ***,the current commonly used data sets and evaluation indicators were ***,in view of the popular algorithms and models in VQA research,comparison of the algorithms and models was summarized and ***,the future development trend and conclusion of visual question answering were prospected.

关键词： computer vision natural language processing visual question answering deep learning attention mechanism

来源：

维普期刊数据库

同方期刊数据库评论

在线全文

学校读者我要写书评

暂无评论

Contrastive visual-question-Caption Counterfactuals on Biased Samples for visual question answering

Contrastive Visual-Question-Caption Counterfactuals on Biase...

引用

第43届中国控制会议

作者： Xiaoqian Ju Boyue Wang Xiaoyan Li Beijing University of Technology

The issue of language priors persists in existing visual question answering(VQA) models, hindering their ability to generalize across diverse QA distributions. Traditional strategies for counterfactual sample synthesis, which aim to eliminate language bias by generating counterfactuals for all training samples, encounter two primary challenges:(1) Not every sample contributes to language bias;thus, indiscriminate counterfactual synthesis may introduce new biases and adversely affect the model learning process.(2) The counterfactuals of questions often lose significant information, failing to effectively heighten the model's sensitivity to key terms. In this paper, we introduce the Contrastive visual-question-Caption Counterfactuals model for Biased Samples in VQA tasks. This model integrates captions to augment visual information within the textual domain and constructs counterfactual samples exclusively for biased samples, thereby mitigating the negative impacts of language ***, we employ a biased sample selection module to identify samples with language biases within the training set,considering that unbiased samples do not exacerbate the model's reliance on language patterns. To enrich the visual content in the textual domain, we synthesize caption-based counterfactual samples. To further enhance the effectiveness of counterfactual samples in improving the model's sensitivity, we develop a counterfactual contrast learning module. This module is designed to discern the relationship between visual and textual components within the same sample. Experimental results demonstrate that our proposed model not only is compatible with various VQA backbones but also significantly improves performance on the out-of-distribution dataset VQA CP v2.

关键词： visual question answering Language bias Counterfactual

来源： cnki会议评论

在线全文

cnki会议

学校读者我要写书评

暂无评论

visual Superordinate Abstraction for Robust Concept Learning

引用

Machine Intelligence Research 2023年第1期20卷 79-91页

作者： Qi Zheng Chao-Yue Wang Dadong Wang Da-Cheng Tao University of Sydney Sydney 2008Australia JD Explore Academy Beijing 100176China DATA61 Commonwealth Scientific and Industrial Research OrganisationSydney 2122Australia

Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure to explore the intrinsic semantic hierarchy of visual concepts, e.g., {red, blue,···} ∈“color” subspace yet cube ∈“shape”. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces(i.e., visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5% for reasoning with perturbations and 15.6% for compositional generalization tests.

关键词： Concept learning visual question answering weakly-supervised learning multi-modal learning curriculum learning

来源：

维普期刊数据库

同方期刊数据库评论

在线全文

学校读者我要写书评

暂无评论

Improving VQA via Dual-Level Feature Embedding Network

引用

Intelligent Automation & Soft Computing 2024年第3期39卷 397-416页

作者： Yaru Song Huahu Xu Dikai Fang School of Computer Engineering and Science Shanghai UniversityShanghai200444China

visual question answering(VQA)has sparked widespread interest as a crucial task in integrating vision and *** primarily uses attention mechanisms to effectively answer questions to associate relevant visual regions with input *** detection-based features extracted by the object detection network aim to acquire the visual attention distribution on a predetermined detection frame and provide object-level insights to answer questions about foreground objects more ***,it cannot answer the question about the background forms without detection boxes due to the lack of fine-grained details,which is the advantage of grid-based *** this paper,we propose a Dual-Level Feature Embedding(DLFE)network,which effectively integrates grid-based and detection-based image features in a unified architecture to realize the complementary advantages of both ***,in DLFE,In DLFE,firstly,a novel Dual-Level Self-Attention(DLSA)modular is proposed to mine the intrinsic properties of the two features,where Positional Relation Attention(PRA)is designed to model the position ***,we propose a Feature Fusion Attention(FFA)to address the semantic noise caused by the fusion of two features and construct an alignment graph to enhance and align the grid and detection ***,we use co-attention to learn the interactive features of the image and question and answer questions more *** method has significantly improved compared to the baseline,increasing accuracy from 66.01%to 70.63%on the test-std dataset of VQA 1.0 and from 66.24%to 70.91%for the test-std dataset of VQA 2.0.

关键词： visual question answering multi-modal feature processing attention mechanisms cross-model fusion

来源：

维普期刊数据库评论

在线全文

维普期刊数据库

学校读者我要写书评

暂无评论

欢迎您,

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

在线全文

在线全文

在线全文

在线全文

在线全文

在线全文

在线全文

请选择保存的检索档案：

请选择收藏分类：

通借通还

欢迎您,

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

在线全文

在线全文

在线全文

在线全文

在线全文

在线全文

在线全文

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：