Relationship between objective and subjective evaluation of heavily-distorted speech signals
Speech applications on smartphones and smart speakers are widely spread in our daily lives. In most cases, distortion-less speech signals are assumed as their input signals, so that they do not work well in the real world, where the speech signals are distorted by acoustic interferences such as background noises and room reverberation. Noise reduction and dereverberation are indispensable for practical speech applications. Noise reduction and dereverberation aim to decrease the speech distortion caused by additive and convolutive acoustic interferences, respectively. Such objectives can be achieved by a wide variety of linear and nonlinear signal processing techniques. In general, the nonlinear methods achieve speech enhancement effectively and efficiently but cause annoying nonlinear distortion on the output target signals. Perceptually-oriented, less-distorted nonlinear speech enhancement is one of the recent trends in acoustic signal processing. Evaluation of speech distortion has also been an important issue for several decades. The relationship between the objective and subjective evaluation of speech distortion has been investigated under controlled experimental conditions. The effect of the speech distortion caused by highly nonlinear signal processing is, however, not well known under realistic, complex, heavily-distorted acoustic conditions. In this poster, the relationship between objective and subjective evaluation is discussed both for linear and nonlinear speech distortion in the severe acoustic conditions. Noisy speech samples were prepared with stationary and non-stationary noises in very noisy conditions where the signal-to-noise ratios (SNRs) were below 0 dB. The noisy speech samples were enhanced by linear and nonlinear signal processing, respectively. A listening test was carried out to quantify the subjective impression on speech distortion using the five-scale Mean Opinion Score (MOS). The relationship between the objective scores calculated by using some speech distortion measures and the subjective MOS is summarized in the viewpoints of types of signal processing, temporal characteristics of noise signals, and SNR.