Ching-Ta Lu, Jun-Hong Shen, Kun-Fu Tseng, and Chih-Tsung Chen
Refinement of Enhanced Speech Using Hybrid-Median Filter and Harmonic Regeneration Approach
Ching-Ta Lu, Jun-Hong Shen, Kun-Fu Tseng, and Chih-Tsung Chen
—This study proposes a post-processor to improve the harmonic structure of a vowel in an enhanced speech, enabling the speech quality to be improved. Initially, a speech enhancement algorithm is employed to reduce the background noise for a noisy speech. Hence the enhanced speech is post-processed by a hybrid-median filter to reduce the musical effect of residual noise. Since the harmonic spectra are impacted by background noise and a speech enhancement process, the quality of a vowel is deteriorated. A harmonic regenerated method is developed to improve the quality of post-processed speech. Experimental results show that the proposed method can improve the quality of post-processed speech by adequately regenerating harmonic spectra.
Index Terms—Harmonic regeneration, hybridmedian filter, noise reduction, post-processing, speech enhancement.
The goal of a speech enhancement system is to remove the background noise while preserving clean speech as much as possible. Many speech enhancement algorithms have been proposed to reduce the background noise in noisy speech environments[1]-[3]. These algorithms attempted to efficiently remove background noise, but musical effect of residual noise is apparent in the enhanced speech. This musical noise is very annoying to the human ear. Accordingly, many studies have paid great efforts in reducing this noise[4]-[6].
Lu[6]proposed a method by using a three-step gain factor and a iterative-directionally-median filter to reduce the background and musical residual noise. In the first stage, a pre-processor enhanced noisy speech by using an algorithm which combined a two-step-decision-directed and the Virag methods[1]. In the second stage, the enhanced speech signal was post-processed by an iterativedirectional-median filter to significantly reduce the quantity of residual noise, while maintaining the harmonic spectra. Esch and Vary[5]proposed performing smoothing on the weighting gains for speech-pause and low-SNR (signal-to-noise ratio), conditions, yielding the musical effect of residual noise being reduced. Jo and Yoo[3]considered a psycho-acoustically constrained and distortion minimized enhancement algorithm. This algorithm minimized speech distortion while the sum of speech distortion and residual noise was kept below the masking threshold.
Based on the above findings, those post processors can efficiently remove residual noise. However, the harmonic spectra deteriorated by background noise and a speech enhancement system cannot be recovered. Since the harmonic spectra can dominate the speech quality of a vowel, we attempt to reconstruct them in which the weak harmonic spectra are deteriorated by the background noise and a speech enhancement process, enabling the speech quality to be improved. Initially, we employ a speech enhancement system[1]to be the first stage for the removal of background noise; meanwhile, speech distortion should be maintained at a low level. The output signal is further processed by the hybrid-median filter[4], yielding the musical effect of residual noise being efficiently reduced. In addition, a harmonic regeneration algorithm is developed to reconstruct the deteriorated harmonic spectra for a vowel. Finally, the hybrid-median filtered spectrum and harmonic regenerated spectrum are mixed to form a post-processed spectrum. Musical tones are therefore significantly smoothed by the hybrid-median filter, enabling the filtered speech to sound much less annoying than the pre-processed speech. In addition, the deteriorated harmonic spectra are adequately regenerated in the post-processed speech. The speech quality is then improved. Experimental results show that the proposed post processor can improve the performance of a speech enhancement system by efficiently removing the musical effect of residual noise, whereas thespeech quality is improved by the regeneration of harmonic spectra.
The rest of this paper is organized as follows. Section 2 introduces the proposed speech enhancement system. Section 3 presents a hybrid-median filter. Section 4 describes the method of harmonic regeneration. Section 5 shows the experimental results. Conclusions are finally drawn in Section 6.
The block diagram of proposed speech enhancement system is shown in Fig. 1. Initially, the noisy speech is framed by a Hanning window, and then transformed into the frequency domain by fast Fourier transform (FFT). A minimum statistics algorithm[7]is employed to estimate the noise magnitude for each subband. Hence, this noise estimate is employed to adapt a speech enhancement system, enabling the background noise to be efficiently removed.
Fig. 1. Block diagram of proposed speech enhancement system.
Because the musical effect of residual noise is apparent in the pre-processed speech, the hybrid-median filter[4]is employed to remove this noise. In addition, the noisy speech is utilized to estimate a pitch period which will be utilized to estimate fundamental frequency. Hence, the robust harmonic spectra are searched for a vowel frame. The number of robust harmonic is employed to adapt the speech-presence probability which will be applied to adapt the hybrid-median filter.
The robust harmonic is also employed to create a comb filter for a vowel frame. The noisy speech is filtered by the comb filter to regenerate harmonic spectra. Hence, the hybrid-median filtered spectra and the regenerated harmonic spectra are mixed to produce the post-processed speech. Finally, the inverse FFT is performed to achieve the post-processed speech.
Directional median filtering[6]is performed when a frame has a strong harmonic structure. The direction candidates are shown in Fig. 2, where the center spectrum is denoted by a filled circle.
Fig. 2. Motion directions of the center spectrum.
If the value of speech-presence probability exceeds a given threshold, the center spectrum is classified as a vowel and kept unchanged to maintain the speech quality. On the other hand, if the value of speech-presence probability lies between 0.2 and 0.8, the center spectrum is classified as vowel-like and filtered by a directional median filter, given as
wherei*denotes the optimum direction, and (,)S mω~ is pre-processed spectrum.
A center spectrum is classified as vowel-like when the number of robust harmonic is great enough. We further check whether the center spectrum is a vowel by the speech-presence probability[8]. As shown in Fig. 2, the optimum motion direction of the center spectrum is selected among three candidate directions (1-3). The decision rule is to find the minimum spectral-distance among the three directions. The spectral-distance measurecan be expressed by
whereidenotes the direction index of the center spectrum, i.e., 1≤i3≤ .
The minimum of spectral-distance measure given in (2) is declared as the optimum motion direction for the center spectrum. The optimum distance measure is given as
The directional-median filter can mitigate the fluctuation of random spectral peaks, enabling the musical effect of residual noise to be reduced. In order to improve the performance in the reduction of musical tones, we employ a block-median filter to significantly smooth the variation of musical tones when a center spectrum is classified as noise-like. The larger the size of the window is, the greater the reduction of the spectral variation is. However, increasing window size would cause a quantity of speech distortion. Therefore, we adopt the window size 3×3 to analyze and filter the pre-processed spectra.
4.1 Robust Harmonic Estimation
Fundamental frequency of speech distributes in the frequency ranges from 50 Hz to 500 Hz. We can perform low-pass filtering on noisy speech with the cut-off frequency 500 Hz to obtain a low-pass signal which can be applied to accurately estimate the pitch period by reducing the inference of high-frequency signals. In turn, the pitch period can be estimated by a weighted autocorrelation function, given as[9]
where ε is a very small value to prevent the denominator being zero, AMDF denotes an average magnitude difference function, and τ represents a time shift.
Harmonic estimation can be performed by the fundamental frequency0ω which can be obtained by the pitch periodT0, given as
whereNdenotes the frame size.
In the experiments, we find that the estimated fundamental frequency obtained by (5) suffers from underestimate. Thus we attempt to shift the location of fundamental frequency0ω to that of the spectral peak for each segment. The shifted frequencycan be expressed as
whereilandelrepresent the starting and ending frames of thelth segment, and0()mω′ denotes the fundamental frequency with the spectral peak.
Robust harmonic takes place on the multiple of fundamental frequencies, i.e.,0nω. The number of robust harmonicKcan be decided by
Observing (8), if the frequency offset between two adjacent harmonic varies heavily, the harmonic structure may become weak. Thus the boundary of robust harmonic can be marked.
4.2 Harmonic Spectrum Regeneration
According (5) to (8), we can find the center frequency of each robust harmonic. The other center frequency of a harmonic can be achieved by duplication in which the harmonic frequencies are greater than0KF. A comb filter can be obtained as[10]
wherecω denotes the center frequency of an estimated harmonic. The quantitycBspecifies the filter gain at harmonic frequencycω. The value ofvBdetermines the filter response outside the harmonic frequency rangecontrols the width of the comb filter (σ=1).
Filtering noisy speech by the comb filter can obtain an enhanced harmonic. Hence, a harmonic regenerated signalcan be obtained by
In the experiments, a speech signal is Mandarin Chinese spoken by a female speaker. The noisy speech was obtained by corrupting clean speech with white, F16-cockpit, factory, helicopter-cockpit, and car noise signals which were extracted from the Noisex-92 database. Three input SNR levels were of 0 dB, 5 dB, and 10 dB, which were used to evaluate the performance of a speech enhancement system. The Virag[1]method and the hybrid-median filter[4]were conducted for comparisons.
5.1 Noise Estimation
A noise estimator has played a major role in determining the quality of a speech enhancement system. If the noise estimate is too low, residual noise increases. Conversely, if the level of noise estimate is too high, enhanced speech sounds would be muffled and intelligibility would be lost. The traditional voice-activity detectors (VADs) are difficult to tune under non-stationary noise corruption. In addition, the voice-activity detector (VAD) application to low SNR speech often results in clipped speech. Thus, the VAD cannot estimate the noise level well in non-stationary and low SNR environments.
Martin[7]proposed the minimum statistics algorithm to estimate the power of noise for each subband. The algorithm does not use the VAD, instead it tracks the minimum power in each subband to decide the noise estimate. The minimum statistics noise tracking method is based on the observation that even during the speech activity a short-term power density estimate of the noisy signal frequently decays to values which are representative of the noise level. This method rests on the fundamental assumption that during a speech pause or within brief periods in between words and syllables, the speech energy is close or identical to zero. Thus, by tracking the minimum power within a finite window large enough to bridge high power speech segments, the noise floor can be estimated. Detailed procedure of the minimum statistics noise estimation algorithm can be found in [7].
5.2 Segmental SNR Improvement
The quantities of noise reduction, residual noise, and speech distortion can be measured by the average segmental SNR improvement (Avg_SegSNR_Imp). The average of segmental SNR (Avg_SegSNR) of a test signal is evaluated according to clean speech (, )s m n, and the enhanced signalIt can be expressed by
The Avg_SegSNR_Imp is computed by subtracting the Avg_SegSNR of noisy speech from that of enhanced speech. Table 1 presents the performance comparisons in terms of the average segmental SNR improvement evaluated in speech-activity regions. The larger the value of the Avg_SegSNR_Imp is, the better the quality of enhanced speech is. Cascading the hybrid-median filter after the Virag method (Virag+Hybrid) can improve the performance of the Virag method. The major reason is attributed to the reduction of musical residual noise; meanwhile, the speech components are not seriously deteriorated. Additionally, incorporating harmonic regeneration with them (Virag+Hybrid+Harmonic), the quality of hybrid-median filtered speech (Virag+Hybrid) can be further improved.
Table 1: Comparison of SegSNR improvement for the enhanced speech in various noises
5.3 Perceptual Evaluation of Speech Quality
The perceptual evaluation of speech quality (PESQ) measure, which has a better correlation with subjective tests than the other objective measures, was selected as the ITU-T recommendation P.862 to evaluate the speech quality of a test signal. In the computation of PESQ score for an enhanced speech signal (or a noisy speech signal), the clean and enhanced speech signals were initially level-equalized to a standard listening level, and then passed through a filter with a response similar to a standard telephone handset. The clean and enhanced speech signals were aligned in the time domain to correct for the time delays between these two signals. Hence, these two signals were processed through an auditory transform, similar to that of perceptual speech quality measure (PSQM) to obtain the loudness spectra. The disturbance, obtained by computing the difference between the loudness spectra for the clean and the enhanced speech signals, was computed and averaged over time and frequency to produce the prediction of subjective mean opinion score (MOS). The maximal PESQ score corresponds to the best speech quality.
Table 2 presents the performance comparisons in terms of the perceptual evaluation of speech quality (PESQ). The maximal PESQ score corresponds to the best speech quality. We can find that incorporating harmonic regeneration with the hybrid-median filter (Virag+Hybrid+ Harmonic) can obtain a higher PESQ score than that without harmonic regeneration. It shows that the proposedpost-processing method can improve the speech quality of a vowel while suppressing the musical residual noise.
Table 2: Comparisons of perceptual evaluation of speech quality (PESQ) for the enhanced speech in various noises
5.4 Waveform Plot and Speech Spectrograms
Fig. 3 demonstrates an example of speech waveform plots. Comparing the waveform plots of enhanced speech shown in Fig. 3 (c) to Fig. 3 (e), the proposed method (Fig. 3 (e)) and the hybrid-median filter method (Fig. 3 (d)) outperform the Virag methods in the removal of background noise, in particular during speech-pause regions. The major reason is that the hybrid-median filter can efficiently remove a quantity of musical residual noise. In a speech-activity region, the proposed method is better able to preserve speech components. So the quality of post-processed speech can be improved.
Fig. 3. Example of speech signal spoken in Mandarin by a female speaker (from top to bottom): (a) clean speech, (b) noisy speech corrupted by white noise with average SegSNR=0 dB, (c) enhanced speech using Virag method, (d) enhanced speech using Virag with hybrid-median filtering, and (e) enhanced speech using the proposed method.
Fig. 4 shows the spectrograms of a speech signal. Observing the spectrograms of enhanced speech during speech-pause regions shown in Fig. 4 (c) to Fig. 4 (e), the proposed method and the hybrid-median filter outperform the Virag method in the removal of background noise, enabling residual noise containments to be lesser than the Virag method. In addition, the proposed method can regenerate deteriorated harmonics, yielding the quality of post-processed speech being improved for a vowel.
Fig. 4. Spectrograms of speech spoken by a female speaker: (a) clean speech, (b) noisy speech corrupted by white noise with average SegSNR=5 dB , (c) enhanced speech using Virag method, (d) enhanced speech using Virag method with hybrid-median filtering, and (e) enhanced speech using proposed method.
Utilizing a hybrid-median filter to a post-process enhanced speech is performed to efficiently remove a quantity of musical residual noise. In addition, a deteriorated harmonic peak is regenerated to improve the quality of post-processed speech for a vowel. Experimental results show that the proposed post-processor can efficiently reduce the musical effect of residual noise and adequately reconstruct deteriorated harmonic spectra for a speech enhancement system, resulting in the post-processed speech sounding less annoying and more comfortable.
[1] N. Virag, “Single channel speech enhancement based on masking properties of the human auditory system,” IEEETrans. on Speech and Audio Processing, vol. 7, no. 2, pp. 126-137, Mar. 1999.
[2] C.-T. Lu, “Enhancement of single channel speech using perceptual-decision-directed approach,” Speech Communication, vol. 53, no. 4, pp. 495-507, Apr. 2011.
[3] S. Jo and C. D. Yoo, “Psychoacoustically constrained and distortion minimized speech enhancement,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 18, no. 8, pp. 2099-2110, Nov. 2010.
[4] C.-T. Lu, K.-F. Tseng, and C.-T. Chen, “Reduction of musical residual noise using hybrid-mean filter,” Int. Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 6, no. 4, pp. 165-178, Aug. 2013
[5] T. Esch and P. Vary, “Efficient musical noise suppression for speech enhancement systems,” in Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Taipei, 2009, pp. 4409-4412.
[6] C.-T. Lu, “Noise reduction using three-step gain factor and iterative-directional-median filter,” Applied Acoustics, vol. 76, no. 2, pp. 249-261, Feb. 2014.
[7] R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans. on Speech and Audio Processing, vol. 9, no. 5, pp. 504-512, Jul.2001.
[8] I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging for robust speech enhancement,” IEEE Signal Processing Letters, vol. 9, no. 1, pp. 12-15, Jan. 2002.
[9] T. Shimanura and H. Kobayashi, “Weighted autocorrelation for pitch extraction of noisy speech,” IEEE Trans. on Speech and Audio Processing, vol. 9, no. 7, pp. 727-730, Oct. 2001.
[10] W. Jin, X. Liu, M. S. Scordilis, and L. Hanb, “Speech enhancement using harmonic emphasis and adaptive comb filtering,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 356-368, Feb. 2010.
Ching-Ta Luwas born in Taiwan in 1969. He received both B.S. and M.S. degrees in electronic engineering from National Taiwan University of Science and Technology, Taipei in 1991 and 1995, respectively; and his Ph. D. degree in electrical engineering from National Tsing Hua University, Hsinchu in 2006. He had been with the Department of Electronic Engineering, Asia-Pacific Institute of Creativity, Miao Li (Aug. 1995-Feb. 2008). He had been the chair of the department in 2000 and 2006. Dr. Lu received the excellent teaching awards in 2006 and 2007, and the best tutor awards in 1986, 1987, 2005, 2006, and 2007 from the Asia Pacific Institute of Creativity. He also received the best tutor awards in 2009 and 2012 from the Asia University. Currently, he is an associate professor with the Department of Information Communication, Asia University, Taichung (since Feb. 2008), and also with the Department of Biomedical Imaging and Radiological Science, China Medical University, Taichung (since Aug. 2014). His current research interests include speech enhancement, image denoising, speech coding, and speech signal processing.
Jun-Hong Shenwas born in Taiwan in 1976. He received the B.S. degree in information engineering from I-Shou University, Kaohsiung in 1999, and the M.S. and Ph.D. degrees in computer science and engineering from National Sun Yat-Sen University, Kaohsiung in 2001 and 2008,respectively.Since August 2009, he has been an assistant professor with the Department of Information Communication at Asia University, Taichung. His research interests include mobile information systems, data mining, web-based information systems, and signal processing.
Kun-Fu Tsengwas born in Taiwan in 1962. He received his Ph.D. degree in electronic engineering from National Defense University, Taipei in 1997. He then joined the faculty of the Electronic Engineering Department at Asia-pacific Institute of Creativity, Miaoli, which was renamed as the Department of Multimedia and Game Science. His research areas focus on speech enhancement and high frequency analysis and simulation of IC package.
Chih-Tsung Chenwas born in Taiwan in 1964. He received his B.S. degree in electronic engineering from National Taiwan University of Science and Technology, Taipei in 1989, and his M.S. degree in electrical engineering from Nation Sun Yat-Sen University, Kaohsiung in 1993. Currently, he is a senior lecturer with the Department of Multimedia and Game Science, Asia-Pacific Institute of Creativity, Miaoli (since Aug. 1993). His current research interests include speech enhancement, image denoising, design of games, and Android applications.
Manuscript received December 5, 2013; revised March 1, 2014. This work was supported by the NCS under Grant No. NSC 102-2221-E-468-004.
C.-T. Lu is with the Department of Information Communication, Asia University, Taichung 41354, and also with the Department of Biomedical Imaging and Radiological Science, China Medical University, Taichung 40402 (Corresponding author e-mail: Lucas1@ms26.hinet.net)
J.-H. Shen is with the Department of Information Communication, Asia University, Taichung 41354 (e-mail: shenjh@asia.edu.tw).
K.-F. Tseng and C.-T Chen are with the Department of Multimedia and Game Science, Asia-Pacific Institute of Creativity, Miaoli 351 (e-mail: kftseng@ms.apic.edu.tw; ctcn@ms.apic.edu.tw).
Digital Object Identifier: 10.3969/j.issn.1674-862X.2014.03.010
Journal of Electronic Science and Technology2014年3期