Do state-of-the-art TTS synthesis systems demand high cognitive load under adverse conditions?
Text-to-speech (TTS) synthesis is increasingly being deployed in many real-world applications. Yet, there is a lack of evaluation studies that measure how listeners cope with listening to synthetic speech produced by such technologies in noisy environments.
Significant improvements in TTS have been made since the adoption of Deep Neural Networks (DNNs). It is now possible to produce synthetic speech that is as intelligible as human speech and has high naturalness when listening in quiet. However, there is little knowledge on how these measures are affected when listening to synthetic speech under adverse conditions. Furthermore, there is little understanding of how synthetic speech interacts with the human cognitive processing system in noise.
In our previous work (Govender et al., 2019, Proc. 10th ISCA Sp. Synth), we investigated the cognitive load of synthetic speech in speech-shaped noise. Pupillometry and self-reported measures were used to measure cognitive load. Perception studies have shown that the pupil response indexes the amount of mental effort allocated by the human cognitive processing system when performing a task (Kahneman et al., 1966, Science, vol. 154, no. 3756, pp. 1583–1585 and Beatty et al., 1966, Psychonomic Science, vol. 5, no. 10, pp. 371–372). Therefore in our experiments, we used pupillometry to measure the pupil response of a listener whilst listening to speech through headphones and thus quantifies listening effort. The stimuli used were synthetic speech mixed with speech shaped noise at -3 dB and -5 dB SNR. Results were compared with the human speech recordings that were used to create the TTS voice. The results showed that in both conditions human speech was easier to listen to than TTS.
In addition, our work indicated that the contribution to an increased cognitive load could be due to the use of a conventional statistical parametric synthesis system (SPSS) which is derived from the source-filter model. In these systems, spectral and excitation features are predicted in separate streams which could potentially destroy correlations that exist between them.
This work aims to confirm whether this is true by using a TTS system trained using a sequence-to-sequence DNN model with an attention mechanism similar to Tacotron 2 (Shen et al., 2018, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing). In this way all speech features are predicted in a unified manner and thus any correlations that exist remain intact. We use a neural vocoder based on WaveRNN (Kalchbrenner et al., 2018, Proc. ICML, pp. 2415–2424) which is capable of reconstructing speech with high fidelity.
We expect to see an improvement in terms of reduced cognitive load in comparison to the previously used conventional statistical parametric synthesis system.