Comparison of ideal mask-based enhancement methods for highly degraded speech
This paper compares the performance of three ideal mask-based enhancement methods for speech mixed with white Gaussian noise at very low signal-to-noise ratios (SNRs). Ideal masks, which require a priori knowledge of both the target signal and the masker, set the upper limit of what can be achieved if the instantaneous or ‘local’ SNR estimator is accurate and reliable. The standard ideal binary mask (IBM) is constructed by means of a binary classification of sound sources as target or interferer in the time-frequency domain. In each time-frequency bin, the local SNR is compared with a threshold referred to as a ‘Local Criterion’ (LC). When the LC is exceeded, a value of one is assigned to the mask. Otherwise, a value of zero is assigned. Subsequently, regions of the mixture signal in which the mask contains zeros are removed. In this study, the IBM was compared with an alternative binary mask comprising [0.1,1] rather than [0,1] gains, and an ideal ratio mask (IRM), which can take any value between 0 and 1. Speech produced by twelve speakers of British English was mixed with white Gaussian noise at SNRs between -29 and -5 dB before enhancement. The results demonstrate that IRMs can be used to obtain near maximal speech intelligibility even at very low mixture SNRs. Some benefits were found of raising the lower gain of [0,1] masks to 0.1 for LC = 0 when the SNR was greater than or equal to -20 dB. These are likely to be due to a reduction in the crudeness of the binary switching and therefore the audibility of artefacts. The results indicate the importance of mask density when mixture SNRs are low, where mask density is defined as the number of ones in the mask.