检索结果-南通市图书馆

A survey of multi-modal learning theory

在线全文

学校读者我要写书评

暂无评论

中山大学学报:自然科学版(中英文) 2023年第5期62卷 38-49页

作者： HUANG Yu HUANG Longbo Institute for Interdisciplinary Information Sciences Tsinghua UniversityBeijing 100084China

Deep multi-modal learning,a rapidly growing field with a wide range of practical applications,aims to effectively utilize and integrate information from multiple sources,known as modalities.Despite its impressive empirical performance,the theoretical foundations of deep multi-modal learning have yet to be fully explored.In this paper,we will undertake a comprehensive survey of recent developments in multi-modal learning theories,focusing on the fundamental properties that govern this field.Our goal is to provide a thorough collection of current theoretical tools for analyzing multi-modal learning,to clarify their implications for practitioners,and to suggest future directions for the establishment of a solid theoretical foundation for deep multi-modal learning.

关键词： multi-modal learning machine learning theory optimization generalization

Transformers in computational visual media:A survey

在线全文

学校读者我要写书评

暂无评论

Computational Visual Media 2022年第1期8卷 33-62页

作者： Yifan Xu Huapeng Wei Minxuan Lin Yingying Deng Kekai Sheng Mengdan Zhang Fan Tang Weiming Dong Feiyue Huang Changsheng Xu NLPR Institute of AutomationChinese Academy of SciencesBeijing 100190China School of Artificial Intelligence University of Chinese Academy of SciencesBeijing 100040China School of Artificial Intelligence Jilin UniversityChangchun 130012China Youtu Lab Tencent Inc.Shanghai 200233China CASIA-LLVISION Joint Lab Beijing 100190China

Transformers,the dominant architecture for natural language processing,have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance.Transformers are sequence-to-sequence models,which use a selfattention mechanism rather than the RNN sequential structure.Thus,such models can be trained in parallel and can represent global information.This study comprehensively surveys recent visual transformer works.We categorize them according to task scenario:backbone design,high-level vision,low-level vision and generation,and multimodal learning.Their key ideas are also analyzed.Differing from previous surveys,we mainly focus on visual transformer methods in low-level vision and generation.The latest works on backbone design are also reviewed in detail.For ease of understanding,we precisely describe the main contributions of the latest works in the form of tables.As well as giving quantitative comparisons,we also present image results for low-level vision and generation tasks.Computational costs and source code links for various important works are also given in this survey to assist further development.

关键词： visual transformer computational visual media(CVM) high-level vision low-level vision image generation multi-modal learning

Visual Superordinate Abstraction for Robust Concept learning

在线全文

学校读者我要写书评

暂无评论

Machine Intelligence Research 2023年第1期20卷 79-91页

作者： Qi Zheng Chao-Yue Wang Dadong Wang Da-Cheng Tao University of Sydney Sydney 2008Australia JD Explore Academy Beijing 100176China DATA61 Commonwealth Scientific and Industrial Research OrganisationSydney 2122Australia

Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure to explore the intrinsic semantic hierarchy of visual concepts, e.g., {red, blue,···} ∈“color” subspace yet cube ∈“shape”. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces(i.e., visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5% for reasoning with perturbations and 15.6% for compositional generalization tests.

关键词： Concept learning visual question answering weakly-supervised learning multi-modal learning curriculum learning

Transformers in medical image analysis

在线全文

学校读者我要写书评

暂无评论

Intelligent Medicine 2023年第1期3卷 59-78页

作者： Kelei He Chen Gan Zhuoyuan Li Islem Rekik Zihao Yin Wen Ji Yang Gao Qian Wang Junfeng Zhang Dinggang Shen Medical School of Nanjing University NanjingJiangsu 210093China National Institute of Healthcare Data Science at Nanjing University NanjingJiangsu 210093China BASIRA Laboratory Faculty of Computer and Informatics EngineeringIstanbul Technical UniversityIstanbulTurkey School of Science and Engineering ComputingUniversity of DundeeUK State Key Laboratory for Novel Software Technology Nanjing UniversityNanjingJiangsu 210093China School of Biomedical Engineering ShanghaiTech UniversityShanghai 201210China Department of Research and Development Shanghai United Imaging Intelligence Co.Ltd.Shanghai 200030China Shanghai Clinical Research and Trial Center Shanghai 201703China

Transformers have dominated the field of natural language processing and have recently made an impact in the area of computer vision.In the field of medical image analysis,transformers have also been successfully used in to full-stack clinical applications,including image synthesis/reconstruction,registration,segmentation,detection,and diagnosis.This paper aimed to promote awareness of the applications of transformers in medical image analysis.Specifically,we first provided an overview of the core concepts of the attention mechanism built into transformers and other basic components.Second,we reviewed various transformer architectures tailored for medical image applications and discuss their limitations.Within this review,we investigated key challenges including the use of transformers in different learning paradigms,improving model efficiency,and coupling with other techniques.We hope this review would provide a comprehensive picture of transformers to readers with an interest in medical image analysis.

关键词： Transformer Medical image analysis Deep learning Diagnosis Registration Segmentation Image synthesis multi-task learning multi-modal learning Weakly-supervised learning

Faster Zero-shot multi-modal Entity Linking via Visual-Linguistic Representation

维普期刊数据库评论

在线全文

维普期刊数据库

学校读者我要写书评

暂无评论

Data Intelligence 2022年第3期4卷 493-508页

作者： Qiushuo Zheng Hao Wen Meng Wang Guilin Qi Chaoyu Bai School of Cyber Science and Engineering Southeast UniversityNanjing 211189China School of Computer Science and Engineering Southeast UniversityNanjing 211189China Key Laboratory of Computer Network and Information Integration(Southeast University) Ministry of EducationNanjing 211189China

multi-modal entity linking plays a crucial role in a wide range of knowledge-based modal-fusion tasks, i.e., multi-modal retrieval and multi-modal event extraction. We introduce the new ZEro-shot multi-modal Entity Linking(ZEMEL) task, the format is similar to multi-modal entity linking, but multi-modal mentions are linked to unseen entities in the knowledge graph, and the purpose of zero-shot setting is to realize robust linking in highly specialized domains. Simultaneously, the inference efficiency of existing models is low when there are many candidate entities. On this account, we propose a novel model that leverages visuallinguistic representation through the co-attentional mechanism to deal with the ZEMEL task, considering the trade-off between performance and efficiency of the model. We also build a dataset named ZEMELD for the new task, which contains multi-modal data resources collected from Wikipedia, and we annotate the entities as ground truth. Extensive experimental results on the dataset show that our proposed model is effective as it significantly improves the precision from 68.93% to 82.62% comparing with baselines in the ZEMEL task.

关键词： Knowledge Graph multi-modal learning Poly Encoders