A generalised recurrent sequence to sequence model for robust and efficient speech recognition
Sequence to sequence neural network architectures have been applied successfully to sequence modelling tasks such as speech recognition. Two popular variants are based on recurrent transformations with cross-modal attention, or on self-attention layers. However, these models share the limitation of requiring full sentences as inputs, which prevents the online, or real-time, usage of the algorithm.
It is commonly observed that the alignment between the decoded graphemes and the acoustic features is loosely monotonic in such networks. This means that speech signals have mostly short term dependencies between linguistic and acoustic units, and that the full sentence models are computationally and mathematically inefficient. Several studies attempted to limit the temporal context to a fixed size window, but reported a degradation in recognition performance.
In this work we describe an end-to-end differentiable neural model that learns to cluster together contiguous acoustic representations into segments. Similar to the Adaptive Computation Time (ACT) algorithm, our model is equipped with a halting unit signalling the end of an acoustic segment. Conceptually, this can be viewed as an end-to-end alternative to the segmental features originally proposed for Hidden Markov Models.
We show that our model is a generalisation of the traditional sequence to sequence model. The low overhead halting unit can make the encoding process stop after every input timestep, leading to an encoder representation identical to the one of the traditional model. This is the default behaviour in the absence of any segmentation incentive, where only the cross-entropy between predictions and targets is optimised.
We explore a set of strategies that encourage the encoder to aggregate multiple timesteps, while maintaining the decoding accuracy. We also identify common patterns in segmental architectures that inhibit segment discovery by design, and show their connection with ACT and related approaches in speech and text processing. In some cases, the reduction in number of timesteps is of 50% or more.
Automatic segmentation has the potential of replacing pyramidal encoding strategies that downsample the input empirically by a constant factor with each layer in the stack. The advantage of segmental representations is their possibly higher correlation with the acoustic units, and the interpretability properties offered by the discovered segment boundaries. Our on-going work is focused on the analysis of the segments, and on the integration with a segmental decoder enabling online recognition. We conclude that, when designed appropriately, neural networks can learn to cluster acoustic representations in an unsupervised way.