A non-intrusive speech quality evaluation algorithm combining auxiliary target learning and convolutional recurrent network
作者机构:School of Information and Communication EngineeringNanjing Institute of TechnologyNanjing 211167 School of Information Science and EngineeringSoutheast UniversityNanjing 210096
出 版 物:《Chinese Journal of Acoustics》 (声学学报(英文版))
年 卷 期:2023年第42卷第2期
页 面:235-250页
核心收录:
学科分类:0711[理学-系统科学] 07[理学] 08[工学] 081104[工学-模式识别与智能系统] 0811[工学-控制科学与工程]
基 金:supported by the National Key Research and Development Program of China(2020YFC2004002,2020YFC2004003) the National Natural Science Foundation of China(62001215) the Scientific Research Fund Project of Nanjing Institute of Technology(CKJC202001)。
摘 要:The objective evaluation of speech quality can replace expensive manual scoring,but current objective indicators usually need pure reference speech,which is difficult to obtain in many practical acoustic systems.A non-intrusive speech quality evaluation algorithm combining auxiliary target learning and a convolutional recurrent network(CRN)is proposed.Bark frequency cepstral coefficients(BFCCs),which are based on human-like auditory filters,are used as the input of the CRN network to effectively reduce the network complexity.Firstly,frame-level features are extracted by a convolutional neural network(CNN)from BFCCs.Then,long-term time dependence and sequence features are modeled by the bidirectional long shortterm memory(BiLSTM)networks in frame-level features.Finally,a self-attention mechanism is introduced into the CRN,thereby adaptively extracting useful information from frame-level features,which is then integrated into the characteristics of the sentence level and mapped into the final objective score.In addition,a multi-task training strategy is adopted,and voice activity detection(VAD)is introduced as an auxiliary learning target to improve the performance of the algorithm.Experiments in public databases show that compared with other non-intrusive algorithms,the proposed algorithm has a better correlation with the mean opinion score(MOS).Moreover,it has a small parameter size and good generalization ability for the distorted speech database with MOS released by ITU-T P.808,which is close to the accuracy of the perceptual evaluation of speech quality(PESQ).