葛宛營 張?zhí)祢U
摘 要:單通道語音增強算法通過從帶噪語音中估計并抑制噪聲成分來得到增強語音。然而,噪聲估計算法在計算時存在過估現(xiàn)象,導(dǎo)致部分估計噪聲能量值比實際值大。盡管可以通過補償消去這些過估值,但引入的誤差同樣會降低增強語音的整體質(zhì)量。針對此問題,提出一種基于計算聽覺場景分析(CASA)的時頻掩蔽估計與優(yōu)化算法。首先,通過直接判決(DD)算法估計先驗信噪比(SNR)并計算初始掩蔽;其次,利用噪聲與帶噪語音在Gammatone頻帶內(nèi)的互相關(guān)(ICC)系數(shù)來計算噪聲的存在概率,結(jié)合帶噪語音能量譜得到新的噪聲估計,減少原估計噪聲中的過估成分;然后,利用優(yōu)化算法對初始掩蔽進行迭代處理以減少其中因噪聲過估而存在的誤差并增加其中的目標(biāo)語音成分,在滿足條件后停止迭代并得到新的掩蔽;最后,利用新的掩蔽合成增強語音。實驗結(jié)果表明在不同的背景噪聲下,相比優(yōu)化前,
新的掩蔽使增強語音獲得了較高的主觀語音質(zhì)量(PESQ)和語音可懂度(STOI)值,
提升了語音聽感與可懂度。
關(guān)鍵詞:計算聽覺場景分析;語音增強;時頻掩蔽;噪聲估計;掩蔽優(yōu)化;語音可懂度
中圖分類號:TN912.35
文獻標(biāo)志碼:A
Abstract: Monaural speech enhancement algorithms obtain enhanced speech by estimating and negating the noise components in speech with noise. However, the over-estimation and the error of the introduction to make up the over-estimation of noise power make detrimental effect on the enhanced speech. To constrain the distortion caused by noise over-estimation, a time-frequency mask estimation and optimization algorithm based on Computational Auditory Scene Analysis (CASA) was proposed. Firstly, Decision Directed (DD) algorithm was used to estimate the priori Signal-to-Noise Ratio (SNR) and calculate the initial mask. Secondly, the Inter-Channel Correlation (ICC) factor between noise and speech with noisein each Gammatone filterbank channelwas used to calculate the noise presence probability, the new noise estimation was obtained by the probability combining with the power spectrum of speech with noise, and the over-estimation of the primary estimated noise was decreased. Thirdly, the initial mask was iterated by the optimization algorithm to reduce the error caused by the noise over-estimation and raise the target speech components in the mask, and the new mask was obtained when the iteration stopped with the conditions met.Finally, the optimization method was used to optimize the estimated? mask.The enhanced speech was composed by using the new mask. Experimental results demonstrate that the new mask has higher Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility measure (STOI) values of the enhanced speech in comparison with the mask before optimization, improving the intelligibility and listening feeling of speech.Key words:? computational auditory scene analysis; speech enhancement; time-frequency mask; noise estimation; mask optimization; speech intelligibility
0 引言
語音增強作為一項前端處理技術(shù),目的是從受噪聲干擾的語音中提取出目標(biāo)語音。按照接收麥克風(fēng)的個數(shù)可將語音增強方法分為單通道和多通道增強方法。相對于多通道語音增強方法,單通道語音增強方法具有成本低、易實現(xiàn)等優(yōu)點,在通信、語音識別等領(lǐng)域有著廣泛的應(yīng)用。傳統(tǒng)的單通道語音增強方法包括譜減法[1]、維納濾波法[2]、子空間算法[3]等。
近年來,研究人員通過模擬人耳處理聲音信號的方式,提出了計算聽覺場景分析(Computational Auditory Scene Analysis, CASA),其中Gammatone濾波器組便是一種用來模擬人耳耳蝸的聽覺模型。經(jīng)過濾波器組處理后的語音信號,能夠得到相對傳統(tǒng)方式更好的效果。基于CASA的語音增強算法通常根據(jù)基音周期[4]、等特征,構(gòu)造區(qū)分目標(biāo)語音與背景噪聲的掩蔽,進而得到增強后的語音信號。在單通道語音增強算法中,需要對噪聲能量進行估計,然而由于噪聲的隨機性,使得估計過程中存在過估現(xiàn)象,從而降低了增強語音的整體質(zhì)量[6-7]。文獻[6]利用Gammatone濾波器組的非線性頻率特征,計算噪聲與帶噪語音在濾波器組各頻帶內(nèi)的互相關(guān)系數(shù),減少估計噪聲中過估的成分后采用凸優(yōu)化算法迭代得到語音能量譜的估計。但算法在得到語音能量譜后還需要進一步聚類處理,利用計算得到的掩蔽恢復(fù)增強語音。受聚類準確性的影響,通常恢復(fù)得到的增強語音在聽感和可懂度方面存在欠缺。
針對上述問題,本文提出一種結(jié)合直接判決(Decision Directed, DD)算法[8]和頻帶內(nèi)互相關(guān)(Inter-Channel Correlation, ICC)系數(shù)[6]的時頻掩蔽估計與優(yōu)化算法。首先,通過DD算法得到初始掩蔽估計;接著,計算出各頻帶內(nèi)噪聲與帶噪語音的互相關(guān)系數(shù),得到噪聲的存在概率;然后,根據(jù)掩蔽的特性確定目標(biāo)函數(shù),結(jié)合前兩步結(jié)果,通過優(yōu)化算法減少初始掩蔽中的誤差;最后,利用新的掩蔽從帶噪語音中去除噪聲信號,得到增強語音。
1 語音增強原理
一般情況下,帶噪語音由語音和加性噪聲合成:
聲信號經(jīng)過Gammatone濾波器組濾波后被分到帶寬不同的64個頻帶中,各頻帶的中心頻率和帶寬由等效矩形帶寬(Equivalent Rectangular Bandwidth, ERB)方法確定[9]。將各頻帶內(nèi)的聲信號經(jīng)過加窗、分幀后得到時頻單元序列,計算每個時頻單元的能量后得到聲信號的能量譜[4]。假設(shè)噪聲與目標(biāo)語音相互獨立,經(jīng)過濾波器組處理后信號的能量在時間幀為t、頻帶中心頻率為f的時頻單元中表示為:
CASA語音增強需要利用掩蔽與帶噪語音合成時域內(nèi)的增強語音[5-6,10]。理想二值掩蔽(Ideal Binary Mask, IBM)為一種常用的掩蔽,其值當(dāng)目前時頻點上語音能量占主導(dǎo)時為1,其他情況下為0。采用IBM得到的增強語音能夠保留目標(biāo)語音占主導(dǎo)的部分,消去其他部分。另一種掩蔽為理想浮值掩蔽(Ideal Ratio Mask, IRM),取值在0~1,且語音部分的值比噪聲部分大。
相對于IBM,采用IRM得到的增強語音能夠保存夾雜在噪聲中的弱語音成分,具有更高的語音質(zhì)量。因此本文計算的掩蔽為理想浮值掩蔽,公式為:
2 掩蔽估計與優(yōu)化
2.1 算法整體框架
本文算法的整體框架如圖1所示。
算法包含兩部分:掩蔽估計和掩蔽優(yōu)化。在掩蔽估計部分,估計噪聲并計算后驗信噪比后,利用最大似然估計得到先驗信噪比,然后計算初始掩蔽。在掩蔽優(yōu)化部分,將通過Gammatone濾波器組后的帶噪語音與估計噪聲信號分幀、加窗處理后進行離散傅里葉變換,計算各頻帶內(nèi)帶噪語音與噪聲信號的互相關(guān)系數(shù);為了修正噪聲過估對初始掩蔽的影響,將得到的互相關(guān)系數(shù)作為噪聲的存在概率并結(jié)合帶噪語音得到優(yōu)化目標(biāo),利用優(yōu)化目標(biāo)對初始掩蔽進行迭代處理,在減少過估而引起的偏差的同時,增加掩蔽中包含的目標(biāo)語音成分。最后使用優(yōu)化后的新掩蔽合成增強語音。
2.2 掩蔽估計
由式(2)~(3)可得掩蔽與語音能量的關(guān)系為:
2.3.3 掩蔽優(yōu)化
由于語音能量取值范圍為(0,+∞),且各時頻單元間能量值差異很大,導(dǎo)致每次迭代計算S^(t, f)的運算量十分大。同時,為解決聚類的準確性和二值掩蔽對算法的影響,本文使用浮值掩蔽值替代式(14)中的能量值來當(dāng)作優(yōu)化目標(biāo):
3 實驗與結(jié)果分析
3.1 實驗參數(shù)與評價指標(biāo)
仿真實驗選取TIMIT數(shù)據(jù)庫[14]中的語音信號。信號采樣頻率為16 kHz,16 bit量化,時長約為2 s。噪聲取自noisex-92數(shù)據(jù)庫[15],分別為Babble噪聲、Engine噪聲和White噪聲,分別在輸入信噪比為-5~5dB、間隔為1dB的情況下測試本文算法。
實驗使用4階64頻帶Gammatone濾波器組,每一幀長20ms,幀重疊為50%。選取參數(shù)為:式(10)中,a=-2,c=2.7,ζ=0.015。式(12)中,Gmin=0.178。式(19)中λ=0.02,式(22)中μ=0.01,式(23)中θ=0.3。式(24)中n1=1,n2=1。
本文使用文獻[16]算法對時域的噪聲信號進行估計,將該算法與文獻[6]算法作為對比算法。選取的評價指標(biāo)除分段信噪比(segmental Signal-to-Noise Ratio, segSNR)外,還有主觀語音質(zhì)量(Perceptual Evaluation of Speech Quality, PESQ)[17] 和語音可懂度(Short-Time Objective Intelligibility measure, STOI)[18]。分段信噪比計算信號每幀的信噪比后取平均值,其值越高說明算法對噪聲的抑制效果越好;PESQ表示增強語音的主觀聽感,其得分越高,表明增強語音的聽感越好;STOI反映了增強語音的失真程度,其數(shù)值越大表明算法造成的失真越小,語音的可懂度越高。
3.2 結(jié)果與分析
表1~2給出了在三種背景噪聲下掩蔽優(yōu)化前后得到的平均PESQ和STOI值。比較結(jié)果可以看出,經(jīng)過掩蔽優(yōu)化后增強語音的聽感與可懂度都得到了提升,尤其是在Engine噪聲下兩項指標(biāo)提升較為明顯。
表3為兩種掩蔽得到的segSNR平均值??梢?,在Babble噪聲和White噪聲下,掩蔽優(yōu)化后得到的segSNR值均小于優(yōu)化前的結(jié)果,即本文提出的掩蔽優(yōu)化算法無法有效抑制噪聲。分析其原因,雖然估計的頻帶間互相關(guān)系數(shù)和真實值存在相似性(見圖4),但其取值范圍在低頻部分并不相同,使得計算得到的噪聲存在概率總是大于實際概率。因此相對初始掩蔽,優(yōu)化后的掩蔽在合成增強語音時保留了更多的噪聲成分。
對比表4中三種算法的PESQ值可看出:本文算法比對比算法得到了相對較高的PESQ值,但在Babble噪聲下其結(jié)果低于文獻[6]算法,是因為文獻[6]采用了結(jié)合IBM與IRM的掩蔽,且最終掩蔽中IBM占比較大,使其算法在吵鬧噪聲下能夠消去多余的噪聲成分,得到更好的主觀聽感。在其他噪聲環(huán)境下,本文算法得到的主觀聽感均高于對比算法。
表6為三種算法得到的segSNR值。由表6可知,文獻[16]算法有最高的噪聲抑制性能,而采用Gammatone頻帶內(nèi)互相關(guān)系數(shù)的文獻[6]算法與本文算法均取得了較低的segSNR值。這一方面是由于頻帶間互相關(guān)系數(shù)和真實值存在差異,另一方面是改進的DD算法在計算時并未區(qū)分語音的低頻和高頻成分,同時其在瞬時信噪比較低時對prop的估計不準確,使得初始掩蔽中仍保留了部分噪聲成分。解決這一問題是本研究下一步工作之一。然而,相對于抑制噪聲的能力,語音增強算法更注重提升語音聽感與可懂度[6,19],根據(jù)PESQ和STOI結(jié)果,本文算法在這兩個方面優(yōu)于對比算法。
4 結(jié)語
本文針對單通道語音增強時,傳統(tǒng)噪聲估計算法中存在的過估現(xiàn)象會影響增強語音的整體質(zhì)量問題,提出一種基于時頻掩蔽估計與優(yōu)化的單通道語音增強算法。該算法在得到初始掩蔽后,利用迭代優(yōu)化增加初始掩蔽中的目標(biāo)語音成分。實驗結(jié)果表明,算法雖然不能提升初始掩蔽抑制噪聲的性能,
但在另外兩項關(guān)鍵指標(biāo)(PESQ和STOI)上
本文算法較對比算法均有明顯提升,說明本文算法能有效提升增強語音的聽感與可懂度。
參考文獻(References)
[1] 曹亮, 張?zhí)祢U, 高洪興, 等. 基于聽覺掩蔽效應(yīng)的多頻帶譜減語音增強方法[J]. 計算機工程與設(shè)計, 2013, 34(1): 235-240. (CAO L, ZHANG T Q, GAO H X, et al. Multi-band spectral subtraction method for speech enhancement based on masking property of human auditory system[J]. Computer Engineering and Design, 2013, 34(1): 235-240.)
[2] 李季碧, 馬永保, 夏杰, 等. 一種基于修正倒譜平滑技術(shù)改進的維納濾波語音增強算法[J]. 重慶郵電大學(xué)學(xué)報(自然科學(xué)版), 2016, 28(4): 462-467. (LI J B, MA Y B, XIA J, et al. An improved Wiener filtering speech enhancement algorithm based on modified cepstrum smooth technology[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2016, 28(4): 462-467.)
[3] BOROWICZ A, PETROVSKY A. Signal subspace approach for psychoacoustically motivated speech enhancement[J]. Speech communication, 2011, 53(2): 210-219.
[4] HU K, WANG D. Unvoiced speech segregation from nonspeech interference via CASA and spectral subtraction[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(6): 1600-1609.
[5] WANG Y, NARAYANAN A, WANG D, et al. On training targets for supervised speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(12): 1849-1858.
[6] BAO F, ABDULLA W H. Noise masking method based on an effective ratio mask estimation in Gammatone channels[J]. APSIPA Transactions on Signal and Information Processing, 2018, 7(e5):1-12.
[7] SUN M, LI Y, GEMMEKE J F, et al. Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback-Leibler divergence[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(7): 1233-1242.
[8] NAHMA L, YONG P C, DAM H H, et al. Convex combination framework for a priori SNR estimation in speech enhancement[C]// Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ. IEEE, 2017: 4975-4979.
[9] 蔣毅, 劉潤生, 馮振明. 基于聽感知特性的雙麥克風(fēng)近講語音增強算法[J]. 清華大學(xué)學(xué)報(自然科學(xué)版), 2014(9): 1179-1183. (JIANG Y, LIU R S, FENG Z M. Dual-microphone speech enhancement algorithm based on the auditory features for a close-talk system[J]. Journal of Tsinghua University (Science and Technology), 2014, 54(9): 1179-1183.)
[10] BAO F, ABDULLA W H. A new ratio mask representation for CASA-based speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2019, 27(1): 7-19.
[11] YONG P C, NORDHOLM S, DAM H H, et al. On the optimization of sigmoid function for speech enhancement[C]// Proceedings of the 19th European Signal Processing Conference. Piscataway: IEEE, 2011: 211-215.
[12] CHEN Z, HOHMANN V. Online monaural speech enhancement based on periodicity analysis and a priori SNR estimation[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015, 23(11): 1904-1916.
[13] ZHENG C, TAN Z, PENG R, et al. Guided spectrogram filtering for speech dereverberation[J]. Applied Acoustics, 2018, 134(5): 154-159.
[14] GAROFOLO J S, LAMEL L F, FISHER W M, et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus[EB/OL]. [2019-01-12]. https://catalog.ldc.upenn.edu/LDC93S1.
[15] VARGA A, STEENEKEN H J M. Assessment for automatic speech recognition II: NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems[J]. Speech Communication, 1993, 12(3): 247-251.
[16] GERKMANN T, HENDRIKS R C. Unbiased MMSE-based noise power estimation with low complexity and low tracking delay[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(4): 1383-1393.
[17] International Telecommunications Union (ITU). Perceptual Evaluation of Speech Quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs[EB/OL]. [2019-01-12]. https://www.itu.int/rec/T-REC-P.862-200102-I/en.
[18] TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125-2136.
[19] LOIZOU P C, KIM G. Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(1): 47-56.
This work is partially supported by the National Natural Science Foundation of China (61671095, 61702065, 61701067, 61771085), the Project of Key Laboratory of Signal and Information Processing of Chongqing (CSTC2009CA2003), the Chongqing Graduate Research and Innovation Project (CYS17219), the Research Project of Chongqing Educational Commission (KJ1600427, KJ1600429).
GE Wanying, born in 1994, M. S. candidate. His research interests include signal processing, speech enhancement.
ZHANG Tianqi, born in 1971, Ph. D., professor. Her research interests include spread spectrum communications, blind signal processing, speech signal processing.