Deep neural networks for predicting human auditory perception
Deep learning resulted in a major boost of speech technology and enabled devices with a relatively high robustness in automatic speech recognition (ASR). For some use cases, the underlying algorithms have become so robust that their degradation in presence of noise is similar to the perception of noisy speech of human listeners. In my talk I will provide examples of models for speech intelligibility, perceived speech quality, and the subjective listening effort derived from deep neural networks that are based on estimates of phoneme probabilities calculated from acoustic observations. In some cases, these algorithms outperform baseline models despite the fact that they operate on a mixture of noise and speech – in contrast to other approaches that often require separate noise and speech inputs. This implies a reduced amount of a priori knowledge for these algorithms, which could be interesting for applying them in the context of hearing research, e.g., for continuous optimization of parameters in future hearing devices. The underlying statistical models were trained with hundreds or thousands hours of speech and are harder to analyze in comparison to many established models; yet they are not black boxes since we have various methods to study their properties, which I will briefly outline in the talk.