Towards microscopic intelligibility modelling
Existing intelligibility models successfully estimate word recognition rates in broadly stated noise conditions. These predictions may be characterized as macroscopic since they represent aggregates, averages over many listeners and many stimuli. Many macroscopic intelligibility models have been proposed in the past, mainly in the context of communication channel assessment. Articulation index (AI), speech-transcription index (STI) and short-time objective intelligibility (STOI) are all predictors of the intelligibility of distorted speech signals. We hypothesize that by employing data-driven techniques we can predict individual listener behaviors at a sub-lexical level. These microscopic models should be capable of making precise predictions of what a specific listener might hear in response to a specific speech signal. Furthermore we expect the development of such models to provide insights or validate our understanding of speech-in-noise perception.
Microscopic approaches to intelligibility prediction have started receiving some attention in recent years due to their potential in facilitating hearing research. We present work preparing the terrain for the development of such microscopic intelligibility models. This overview covers studies ranging from data acquisition and definition of evaluation metrics, to the analysis of collected experimental responses and intelligibility-related ASR performance.
At first, we limit our field of study to single word recognition. We review the proposal of tasks and methods to evaluate microscopic intelligibility models in such a setup. We present a corpus of noise-induced British English speech misperceptions, mimicking the existing consistent confusion corpus in Spanish. We then discuss the language factors that influence the elicitation of such confusions, and focus on the linguistic effects such as word usage frequency, or acoustic factors such as the type of masking noise. On the modeling side we start by showing how intelligibility measures such as STOI can be a better predictor of word-error rates of ASR systems than SNR. Finally we review recent existing work on using ASR-style modelling to predict fine-grained speech perception responses.