Spectrotemporal prediction errors support perception of degraded speech
Speech perception depends not only on signal quality but also on supportive contextual cues or prior knowledge (Sohoglu et al., 2014, JEP:HPP 40:186). Predictive coding (PC) theories provide a common framework to explain the neural impact of these two changes to speech perception. According to PC accounts, neural representations of expected sounds are subtracted from bottom-up signals, such that only the unexpected parts (i.e. ‘prediction error’) are passed up the cortical hierarchy to update higher-level representations (Rao and Ballard, 1999, Nat. Neurosci. 2:79). Previous multivariate fMRI data (Blank and Davis, 2016, PLoS Biol. 14: e1002577) show that when listeners’ predictions are weak or absent, neural representation are enhanced for higher-fidelity speech sounds. However, when listeners make accurate predictions (e.g. after matching text), higher-fidelity speech leads to suppressed neural representations despite better perceptual outcomes. Computational simulations reported by Blank and Davis (2016) demonstrate that these observations are uniquely consistent with prediction error computations, and challenge alternative accounts in which all forms of perceptual improvement should enhance neural representation (Aitchison and Lengyel, 2017, Curr. Opin. Neurobiol. 46:219). In the current work we applied forward encoding models (Crosse et al., 2016, Front. Hum. Neurosci. 10:604) to MEG data and test whether cross-over interactions between signal quality and prior knowledge on neural representations are evident at early stages of processing (within 200 ms of speech input).
We analysed data from a previous MEG study (N=21, English speakers) which measured evoked responses to degraded spoken words (Sohoglu and Davis, 2016, PNAS 13:E1747). Listeners heard noise-vocoded speech with varying signal quality (spectral channels), preceded by matching or mismatching written text (prior knowledge). Consistent with previous findings (Sohoglu et al., 2014, JEP:HPP 40:186), ratings of speech clarity were enhanced by greater spectral detail and matching text.
We report two main MEG findings: (1) MEG responses to speech were best predicted using spectrotemporal modulations (outperforming envelope, spectrogram and phonetic feature representations). (2) We observed a cross-over interaction between clarity and prior knowledge, consistent with prediction error representations; if matching text preceded speech then greater spectral detail was associated with reduced forward encoding accuracy whereas increased encoding accuracy was observed with greater spectral detail following mismatching text. This interaction emerged in MEG responses within 200 ms of speech input, consistent with early computations of prediction error proposed by PC theories. These findings contribute towards the detailed specification of a computational model of speech perception based on PC principles.