目前基于深度学习的语音增强方法一般是通过在频域中对语音信号幅度谱进行处理,相位信息受到损失。针对这一问题,提出一种基于时域全卷积网络的语音增强方法。该方法通过设计全卷积神经网络在时域中对语音信号进行处理,保留了信号的原始相位信息,以含噪语音和纯净语音作为网络的输入和输出,建立时域上的非线性关系,实现以端到端的方式进行语音增强。通过仿真实验表明,提出的基于时域全卷积神络语音增强方法在低信噪比的情况下,能够有效地提高语音质量。
At present, speech enhancement methods based on deep learning generally process the amplitude spectrum of speech signal in the frequency domain, and the phase information is lost to some extent. To solve this problem, a speech enhancement method based on time-domain full convolutional network is proposed. The method processes speech signal in time domain by designing full convolutional neural network, and preserves the original phase information of the signal. The noisy speech and clean speech are used as the input and output of the network, and the nonlinear relationship in the time domain is established to realize the end-to-end speech enhancement. The simulation results show that the proposed speech enhancement method based on time-domain full convolution can effectively improve speech quality under the condition of low signal to noise ratio.
2022,44(15): 139-144 收稿日期:2021-08-12
DOI:10.3404/j.issn.1672-7649.2022.15.029
分类号:TN912.35
基金项目:国家自然科学基金资助项目(61771483)
作者简介:李文志(1996-),男,硕士研究生,研究方向为数字信号处理
参考文献:
[1] PALIWAL K, SCHWERIN B, WÓJCICKI K. Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator[J]. Speech Communication, 2012, 54(2): 282–305
[2] KUMAR M A, CHARI K M. Noise reduction using modified wiener filter in digital hearing aid for speech signal enhancement[J]. Journal of Intelligent Systems, 2020, 29(1): 1360–1378
[3] TACHIOKA Y. DNN-based voice activity detection using auxiliary speech models in noisy environments [C] // IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: 5529–5533.
[4] FAYEK H M, LECH M, CAVEDON L. Evaluating deep learning architectures for speech emotion recognition[J]. Neural Networks:The Official Journal of the International Neural Network Society, 2017, 92: 60–68
[5] ZHANG S, CHEN A, GUO W, et al. Learning deep binaural representations with deep convolutional neural networks for spontaneous speech emotion recognition[J]. IEEE Access, 2020, 8: 23496–23505
[6] ZHANG S, ZHANG S, HUANG T, et al. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J]. IEEE Transactions on Multimedia, 2018, 20: 1576–1590
[7] FAYEK H M, LECH M, CAVEDON L. Evaluating deep learning architectures for speech emotion recognition[J]. Neural Networks, 2017, 92: 60–68
[8] WANG D L. Deep learning reinvents the hearing aid[J]. IEEE Spectrum, 2017, 54(3): 32–37
[9] XU Y, DU J, DAI L R, et al. A regression approach to speech enhancement based on deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23: 7–19
[10] XU Y, DU J, DAI L R, et al. An experimental study on speech enhancement based on deep neural networks[J]. IEEE Signal Processing Letters, 2013, 21(1): 65–68
[11] 张明亮, 陈雨. 基于全卷积神经网络的语音增强算法[J]. 计算机应用研究, 2020, 37(S1): 135–137
[12] KOUNOVSKY T, MALEK J. Single channel speech enhancement using convolutional neural network[C] // IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), 2017: 1–5.
[13] PARK S R, LEE J. A fully convolutional neural network for speech enhancement[J]. Interspeech 2017: 1993–1997
[14] ZHAO H, ZARAR S, TASHEV I, et al. Convolutional-recurrent neural networks for speech enhancement[C]// ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: 2401–2405.
[15] ALA B, MY C, CZA B, et al. Speech enhancement using progressive learning-based convolutional recurrent neural network[J]. Applied Acoustics, 2020, 166: 107347
[16] FU, S, TSAO, Y, LU, X. SNR-Aware Convolutional neural network modeling for speech enhancement[C] // Interspeech 2016: 3768–3772.
[17] PALIWAL K K, WÓJCICKI KK, SHANNON B J. The importance of phase in speech enhancement[J]. Speech Communication, 2011, 53(4): 465–494
[18] YIN D, LUO C, XIONG Z, et al. PHASEN: A phase-and-harmonics-aware speech enhancement network[J]. arXiv: 1911.04679, 2019.
[19] WILLIAMSON D S, Wang Y, Wang D. Wang Complex ratio masking for monaural speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(3): 483–492
[20] OORD A, DIELEMAN S, ZEN H, et al. WaveNet: A generative model for raw audio[J]. arXiv: 1609.03499, 2016.
[21] 缪裕青, 邹巍, 刘同来, 等. 基于参数迁移和卷积循环神经网络的语音情感识别[J]. 计算机工程与应用, 2019, 55(10): 135–140
[22] 罗仁泽, 王瑞杰, 张可, 等. 残差卷积自编码网络图像去噪方法[J]. 计算机仿真, 2021, 38(5): 455–461
[23] FU S W, YU T, LU X, et al. Raw waveform-based speech enhancement by fully convolutional networks [C] 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2017: 006–012.