A new set of superimposed speech features to predict a priori the performance of automatic speech recognition systems
The aim of this exploratory work is to predict, a priori, the quality of the automatic transcription in the case of speech mixed with music. In order to make this prediction, we need to quantify the impact of music (considered as noise in our study) on speech, before decoding by an Automatic Speech Recognition (ASR) system. Generally, the estimate of noise level in a speech signal exploits the bimodality of the noisy speech distribution. When the studied noise is music, the distribution has more than two modes, which makes noise level estimation unviable.
We propose a new set of features.
Entropy modulation (Pinquier et al., 2002, ICSLP) detects how much a signal is considered speech by measuring the lack of order of the signal: the music signal has a more orderly structure than speech. A voiced/unvoiced signal duration ratio is computed to measure f0 detection anomalies due to background music. The dip test (Hartigan et al., 1985, The Annals of Statistics, vol. 13, p. 70-84) measures the unimodality of f0 distribution. When music is mixed with speech f0 distribution becomes less unimodal. The excitation behavior of linear prediction residuals (Ferreira et al., 2019, SPIN), which is originally a reverberation measure quantifies the "strength of voicing" of a voiced signal: for the case of reverb, the superposition between phonemes modifies this value.
The experiment was conducted on the Wall Street Journal (WSJ) corpus. Six pieces of music of different styles from RFM directory of the MUSAN corpus (Snyder et al., 2015, arXiv:1510.08484) were mixed with WSJ at three levels of the signal-to-noise ratio (5, 10 and 15 dB). The mean absolute error (MAE) of Word Error Rate (WER) prediction obtains 9.17 at the utterance level. The main goal is to inform users as soon as possible about the quality of the automatic transcription of their audio documents. In most cases, records submitted by transcription services users exceed 3 min. When 20 utterances are used (around 140s), MAE of WER prediction achieves 3.82. This experiment indicated that our set of features was probably well correlated with the recognition error of the ASR system, in this case of speech mixed with music.