Babble noise augmentation for phone recognition applied to children reading aloud in a classroom environment
Current performance of speech recognition for children is below that of the state-of-the-art for adult speech (Shivakumar et al. 2018, arXiv:1805.03322; Hadian et al. 2018, Proc. Interspeech, 12-16). Young child speech is particularly difficult to recognise, and substantial corpora are missing to train acoustic models. Furthermore, in the scope of our reading assistant for 5-7-year-old children learning to read, models need to cope with slow reading rate, disfluencies, and classroom-typical babble noise.
In this work, we aim at improving a phone recognition system’s robustness to babble noise, to be able to give accurate feedback to children reading aloud despite the noisy environmental conditions. We use a data augmentation method that consists of mixing the speech recordings with babble noise recordings at target Signal-to-Noise (SNR) ratios of 2, 5, 10 and 15 dB. The speech recordings are part of our in-house speech dataset, gathered directly in schools or via the Lalilo platform. The noisy recordings come either from the DEMAND corpus (Thiemann et al. 2013, Zenodo.1227121), where babble noise is composed of adult voices and is constant, or from our in-house noise (IHN) corpus, containing real-life classroom environments, where babble noise comes mostly from children and is much more irregular. The evaluation set is comprised of recordings where children read isolated words, in a classroom environment, with SNRs varying between -10 and 50 dB: mean SNR of 23.8 dB and standard deviation of 10.9 dB.
To build a phone recognition system, we used a model trained on the Commonvoice French adult corpus, to do transfer learning (TL) with our small children corpus. The TL method (Shivakumar et al. 2018, arXiv:1805.03322) takes the source adult model and re-trains it with child data, with higher learning factors for output layers. We use separately clean, clean+DEMAND-augmented and clean+IHN-augmented child data as the target data for transfer learning. We show that adapting an adult model trained on clean speech with noise-augmented child data improves the system’s global performance on our evaluation subset. When measuring performance as a function of the SNR, we observe that noise augmentation highly reduces the error rate for very noisy recordings (SNR < 10 dB) and does not degrade performance for clean recordings. Transfer learning with babble noise augmented child data thus enabled an improvement in the child speech recognition systems’ robustness to classroom-typical babble noise, a necessary quality for vocal reading assistants.