咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Audio-Text Multimodal Speech R... 收藏

Audio-Text Multimodal Speech Recognition via Dual-Tower Architecture for Mandarin Air Traffic Control Communications

作     者:Shuting Ge Jin Ren Yihua Shi Yujun Zhang Shunzhi Yang Jinfeng Yang 

作者机构:School of Computer Science and Software EngineeringUniversity of Science and Technology LiaoningAnshan114051China Institute of Applied Artificial Intelligence of the Guangdong-Hong Kong-Macao Greater Bay AreaShenzhen Polytechnic UniversityShenzhen518055China Shenzhen Institutes of Advanced TechnologyChinese Academy of SciencesShenzhen518055China Industrial Training CentreShenzhen Polytechnic UniversityShenzhen518055China 

出 版 物:《Computers, Materials & Continua》 (计算机、材料和连续体(英文))

年 卷 期:2024年第78卷第3期

页      面:3215-3245页

核心收录:

学科分类:081203[工学-计算机应用技术] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

基  金:This research was funded by Shenzhen Science and Technology Program(Grant No.RCBS20221008093121051) the General Higher Education Project of Guangdong Provincial Education Department(Grant No.2020ZDZX3085) China Postdoctoral Science Foundation(Grant No.2021M703371) the Post-Doctoral Foundation Project of Shenzhen Polytechnic(Grant No.6021330002K) 

主  题:Speech-text multimodal automatic speech recognition semantic alignment air traffic control communications dual-tower architecture 

摘      要:In air traffic control communications (ATCC), misunderstandings between pilots and controllers could result in fatal aviation accidents. Fortunately, advanced automatic speech recognition technology has emerged as a promising means of preventing miscommunications and enhancing aviation safety. However, most existing speech recognition methods merely incorporate external language models on the decoder side, leading to insufficient semantic alignment between speech and text modalities during the encoding phase. Furthermore, it is challenging to model acoustic context dependencies over long distances due to the longer speech sequences than text, especially for the extended ATCC data. To address these issues, we propose a speech-text multimodal dual-tower architecture for speech recognition. It employs cross-modal interactions to achieve close semantic alignment during the encoding stage and strengthen its capabilities in modeling auditory long-distance context dependencies. In addition, a two-stage training strategy is elaborately devised to derive semantics-aware acoustic representations effectively. The first stage focuses on pre-training the speech-text multimodal encoding module to enhance inter-modal semantic alignment and aural long-distance context dependencies. The second stage fine-tunes the entire network to bridge the input modality variation gap between the training and inference phases and boost generalization performance. Extensive experiments demonstrate the effectiveness of the proposed speech-text multimodal speech recognition method on the ATCC and AISHELL-1 datasets. It reduces the character error rate to 6.54% and 8.73%, respectively, and exhibits substantial performance gains of 28.76% and 23.82% compared with the best baseline model. The case studies indicate that the obtained semantics-aware acoustic representations aid in accurately recognizing terms with similar pronunciations but distinctive semantics. The research provides a novel modeling pa

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分