Perceptually trained end-to-end FFTNet neural model for single channel speech enhancement
Single channel speech enhancement is challenging task. Recent advancements in machine learning (ML) show that the combination of linear and non-linear operators are able to model the complex characteristics of signals, including that of speech and noise. However, the modeling power of ML models highly depends on the layer-wise design of their architecture as well as the loss function on which the parameters are optimized. Though different neural approaches have been suggested to suppress the noise artifacts for speech enhancement, not many have tried to explore the statistical differences between speech and noise signals in a mixture. Further, most existing models performance were optimized on the waveform domain of speech, ignoring the frequency selective aspects of human auditory perception.
To address these constraints, recently, we had suggested a parallel, non-causal, waveform domain end-to-end FFTNet neural architecture. In this work, we suggest an extension of the FFTNet model optimized on the perceptually salient spectral domain of the enhanced signal for single channel speech enhancement. The proposed model has a dilation pattern which resembles to the classical FFT coefficient computing Butterfly structure. In contrast to other waveform based approaches like WaveNet, FFTNet uses an initial wide dilation pattern. Such an architecture better represents the long term correlated structure of speech in the time domain. On the contrary noise is usually highly non-correlated in such a wide dilation pattern. To further strengthen this feature of FFTNet, we suggest a non-causal FFTNet architecture, where the present sample in each layer is estimated from the past and future samples of the previous layer. By optimizing the parameters on the spectral domain objective, the suggested model can better learn the features which are perceptually more significant than the temporal training approach. We refer to that as SE-FFTNet.
The suggested SE-FFTNet model has shown considerable improvement over existing models like WaveNet or SEGAN while having far-lesser parameters. Perceptually, it improves PESQ by up to 14% over WaveNet and SEGAN. In terms of reconstructed spectral components, the Log-spectral-distortion (LSD) has been reduced by 7.6% over SEGAN and 16.1% over WaveNet. Informal listening tests and objective metrics confirm that the suggested model optimized on spectral objective produces better enhancement than the same model trained by minimizing sample objective function (improving PESQ 10% and LSD 6.9%). Formal listening tests are ongoing.