Glimpses of what? Effects of varying the substrate while keeping the spectro-temporal mask constant
Speech can be generated by sampling a base signal (the substrate) at the locations defined by a subset of spectro-temporal regions (the mask). Glimpse resynthesis performed in this way can lead to highly-intelligible speech. To explore what properties of the underlying substrate are important for successful speech perception, the current study examined the effect of changing the substrate while keeping the mask constant.
For each of 240 sentences in the Spanish Harvard Corpus, glimpse resynthesis was applied to a mask formed from spectro-temporal regions where the speech was more energetic than a speech-shaped noise masker when mixed at 0 dB SNR. In each case, the resulting mask was applied to 7 distinct substrates which varied in the amount of information they contained from the original speech. Two substrates contained the complete speech signal, viz. the speech signal itself, and the speech-plus-noise mixture. Two substrates contained partial speech information: either the temporal fine structure came from the mixture and the envelope from the masker or vice versa. Three further substrates -- speech-shaped noise, wideband polyphonic music, and speech from a different language (English) -- contained no information at all from the original speech. Some 26 Spanish normal-hearing listeners identified 5 keywords per sentence in the 7 constant-mask conditions, and in an additional condition involving the original speech-plus-noise mixture employed without a mask.
The keywords correct score for the non-masked mixture was 87%. Expressed as proportions of this score, listeners performed equivalently for masks applied to the speech signal (1.02) and the mixture (0.97). Removal of amplitude information in the mix led to a modest loss in intelligibility (0.91) while removal of mixture fine structure produced a larger fall (0.83). However, replacing the substrate by a noise masker led to no further drop in intelligibility (0.81). Listeners were also able to identify substantial numbers of keywords in the music substrate (0.61), but performance fell to near chance when the substrate was a different-language masker (0.04). These findings reveal that the mask alone contains a remarkable amount of information to support speech perception, but that properties of the substrate can interfere with listeners' ability to treat the mask as conveying speech cues. Intriguingly, listeners reported that they were totally unaware of the nature of the musical substrate, and were generally unable to recognise any words in the different-language masker nor identify its language.