检索结果-南通市图书馆

Using Speaker-Specific Emotion Representations in wav2vec 2.0-Based Modules for Speech Emotion Recognition

Computers, Materials & Continua 2023年第10期77卷 1009-1030页

作者： Somin Park Mpabulungi Mark Bogyung Park Hyunki Hong College of Software Chung-Ang UniversitySeoul06973Korea Department of AI Chung-Ang UniversitySeoul06973Korea

Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware *** properties of individuals’voices vary with culture,language,gender,and *** variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition(SER).This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER *** the proposed approach,two wav2vec-based modules(a speaker-identification network and an emotion classification network)are trained with the Arcface *** speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific *** emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion *** two representations are then fused into a single vector representation containing emotion and speaker-specific *** results showed that the use of speaker-specific characteristics improves SER ***,combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability,as demonstrated by the plots of t-distributed stochastic neighbor embeddings(t-SNE).The proposed approach outperforms previous methods using similar training strategies,with a weighted accuracy(WA)of 72.14%and unweighted accuracy(UA)of 72.97%on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)*** demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.

关键词： Attention block IEMOCAP dataset speaker-specific representation speech emotion recognition wav2vec 2.0

来源：

维普期刊数据库评论

在线全文

维普期刊数据库

学校读者我要写书评

暂无评论

Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases

引用

Journal of Systems Science and Systems Engineering 2024年第5期33卷 576-606页

作者： Karim Dabbabi Abdelkarim Mars Research Unite of Analyse and Processing of Electrical and Energetic Systems Faculty of Sciences of TunisTunis El-Manar University2092Tunis-Tunisia Research Laboratory in Algebra Numbers Theory and Intelligent SystemsFaculty of Sciences of Monastir90 Mohamed V street5000-MonastirTunisia

Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to develop a deep learning model that utilizes Distil HuBERT for jointly learning these combined features in speech emotion recognition (SER). Our experiments highlight its distinct advantages: it significantly outperforms wav2vec 2.0 in both offline and real-time accuracy on RAVDESS and BAVED datasets. Although slightly trailing HuBERT’s offline accuracy, Distil HuBERT shines with comparable performance at a fraction of the model size, making it an ideal choice for resource-constrained environments like mobile devices. This smaller size does come with a slight trade-off: Distil HuBERT achieved notable accuracy in offline evaluation, with 96.33% on the BAVED database and 87.01% on the RAVDESS database. In real-time evaluation, the accuracy decreased to 79.3% on the BAVED database and 77.87% on the RAVDESS database. This decrease is likely a result of the challenges associated with real-time processing, including latency and noise, but still demonstrates strong performance in practical scenarios. Therefore, Distil HuBERT emerges as a compelling choice for SER, especially when prioritizing accuracy over real-time processing. Its compact size further enhances its potential for resource-limited settings, making it a versatile tool for a wide range of applications.

关键词： wav2vec 2.0 Distil HuBERT HuBERT SER audio and audio-visual features

来源：

维普期刊数据库评论

在线全文

维普期刊数据库

学校读者我要写书评

暂无评论

多任务师生模型的语音情感识别实验设计

引用

实验科学与技术 2024年

作者：孙林慧李平安雷云龙张子晓南京邮电大学通信与信息工程学院

针对人机智能交互中语音情感识别的研究热点，将基于多任务约束师生模型的含噪语音情感识别设计为研究型教学实验，观察教师模型的指导作用、学生模型的学习过程和多级增强损失的约束力。设计基于wav2vec 2.0的师生模型和多级增强损失... 详细信息

针对人机智能交互中语音情感识别的研究热点，将基于多任务约束师生模型的含噪语音情感识别设计为研究型教学实验，观察教师模型的指导作用、学生模型的学习过程和多级增强损失的约束力。设计基于wav2vec 2.0的师生模型和多级增强损失机制，并将语音增强辅助任务引入学生模型，使学生模型能够通过学习获取教师模型的特征表示能力。在测试阶段学生模型直接从含噪语音中提取关键情感特征，用于情感分类。最后通过大量实验分析情感识别系统的性能和鲁棒性。该师生模型实验设计有助于提升学生思考能力、科研创新和探索意识。

关键词：语音情感识别多任务约束语音增强 wav2vec 2.0 教师学生模型

来源：

同方期刊数据库评论

在线全文

同方期刊数据库

学校读者我要写书评

暂无评论

Self-Diffuser:语音驱动人脸表情的技术研究

引用

计算机科学与应用 2024年第8期14卷 236-249页

作者：臧梦利王少波智宇陈昂温州大学计算机与人工智能学院元宇宙与人工智能研究中心浙江温州温州大学元宇宙与人工智能研究院浙江温州

先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕... 详细信息

先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕捉音频和面部运动之间复杂的多对多关系来完成任务。本文的Self-Diffuser方法使用预训练的大语言模型wav2vec 2.0对音频输入进行编码,通过引入基于扩散的技术,将其与Transformer相结合来完成生成任务。本研究不仅克服了传统回归模型在生成具有唇读可理解性的真实准确唇运动方面的局限性,还探讨了精确的嘴唇同步和创造与语音无关的面部表情之间的权衡。通过对比、分析当前最先进的方法,本文的Self-Diffuser方法,使得语音驱动的面部动画产生了更精确的唇运动;在与说话松散相关的上半部表情方面也产生了更贴近于真实说话表情的面部运动;同时本文模型引入的扩散机制使得生成3D人脸动画序列的多样性能力也大大提高。Previous research on speech-driven facial expression animation has achieved realistic and accurate lip movements and facial expressions from audio signals. Traditional methods primarily focused on learning deterministic mappings from speech to animation. Recent studies have started exploring the diversity of speech-driven 3D facial animation, aiming to capture the complex many-to-many relationships between audio and facial motion by leveraging the diversity capabilities of diffusion models. In this study, the Self-Diffuser method is proposed by utilizing the pre-trained large-scale language model wav2vec 2.0 to encode audio inputs. By introducing diffusion-based techniques and combining them with Transformers, the generation task is accomplished. This research not only overcomes the limitations of traditional regression models in generating lip movements that are both realistic and lip-reading comprehensible, but also explores the trade-off between precise lip synchronization and creating facial expressions independent of speech. Through comparisons and analysis with the current state-of-the-art methods, the Self-Diffuser method in this paper achieves more accurate lip movements in speech-driven facial animation. It also produces facial motions that closely resemble real speaking expressions in the upper face region correlated with speech looseness. Additionally, the introduced diffusion mechanism significantly enhances the diversity capabilities in generating 3D facial animation sequences.

关键词： wav2vec 2.0 Transformer 扩散机制语音驱动面部动画

来源：

维普期刊数据库博看期刊评论

在线全文

学校读者我要写书评

暂无评论

欢迎您,

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

在线全文

在线全文

在线全文

在线全文

请选择保存的检索档案：

请选择收藏分类：

通借通还

欢迎您,

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

在线全文

在线全文

在线全文

在线全文

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：