Fully convolutional Wasserstein autoencoder for speech enhancement
Speech enhancement methods traditionally operate in the time-frequency domain and/or exploit some higher-level features such as Mel-Frequency Cepstral Coefficients. By default, these approaches are discarding the phase and are only partially using the input data. The signal reconstruction necessitates to estimate the phase which is a difficult problem on its own. To overcome this limitation, new speech enhancement techniques operate in the time domain. These data-based approaches based on deep learning directly map the raw input signal waveform to the enhanced speech signal. The success of the Wavenet in speech synthesis promptly motivated the use of generative models for speech enhancement. Similarly, Generative Adversarial Networks (GAN) have shown good performance in denoising by effectively suppressing additive noise in raw waveform speech signals.
In order to benefit from the powerful acoustic modeling capabilities of the recent generative approaches, we propose to use a Wasserstein Auto-Encoder (WAE) for building a generative model of the clean speech. More specifically, this model uses a fully convolutional encoder and decoder architecture with skip connections. The generative model we propose do not suffer from instability during the training phase.
We apply our network to the problem of denoising to remove the additive noise from the target signals. Our model is trained on pairs of noisy and clean audio examples and at test-time, it predicts clean sample from a noisy signal. Our approaches outperformed other speech enhancement approaches and it demonstrates the effectiveness of convolutional autoencoder architectures on an audio generation task.