Fast speech intelligibility estimation using a neural network trained via distillation
Objective measures of speech intelligibility have many uses, including the evaluation of degradation during transmission and the development of processing algorithms. One intrusive approach is to use a method based on the audibility of speech glimpses. The binaural version of the glimpse method can provide more robust performance compared to other binaural state-of-the-art metrics. However, the glimpse method is relatively slow to evaluate and this limits its use to non-real time applications. We explored the use of machine learning to allow the glimpse-based metric to be estimated quickly.
Distillation is an established machine learning approach. A complex model is used to derive a simpler machine-learnt model capable of real-time operation. The simpler student model is trained on synthetic data generated from the complex teacher model, thereby distilling knowledge from teacher to student. In this case the teacher is the slow glimpse-based model, and the student an artificial neural network. Once the neural network is trained, the student rapidly estimates the glimpse-based speech intelligibility metric. It is fast enough to allow real-time operation as an intelligibility meter in a Digital Audio Workstation.
A shallow artificial neural network with a relatively simple structure is found sufficient. The inputs to the network are cross-correlations between Mel-frequency cepstral coefficients (MFCCs) for the clean and noisy speech. Only the largest value of the cross-correlation for the left and right ear signals are used as inputs, to simulate better-ear binaural listening. Even for this lightweight artificial neural network, a large amount of training data is necessary to make the distillation robust. 1,200 hours of audio samples were used containing speech from a wide range of sources (SALUC, SCRIBE and r-spin speech corpora and librivox audiobooks). Maskers included speech-shaped noise, competing speech, amplitude-modulated noise, music and sound effects. The signal-to-noise ratio ranged from -20 to 20 dB. Performance is evaluated using test data set not used in training. A comparison between the estimated speech intelligibility to the full glimpse-based model gave an r2 of 0.96 and a mean square error of 0.003.