An Exploration of Mimic Architectures for Residual Network Based Spectral Mapping
Peter Plantinga, Deblin Bagchi, Eric Fosler-Lussier

TL;DR
This paper investigates the use of residual networks and long-term context integration in spectral mapping to improve speech enhancement, achieving state-of-the-art results in speech recognition accuracy.
Contribution
It introduces residual network architectures and wide-residual biLSTM models for spectral mapping, enhancing speech cleaning performance over traditional DNN approaches.
Findings
Residual networks outperform DNNs in spectral mapping.
Long-term context integration improves speech enhancement.
Achieved lowest WER of 9.3% on CHiME-2 dataset.
Abstract
Spectral mapping uses a deep neural network (DNN) to map directly from noisy speech to clean speech. Our previous study found that the performance of spectral mapping improves greatly when using helpful cues from an acoustic model trained on clean speech. The mapper network learns to mimic the input favored by the spectral classifier and cleans the features accordingly. In this study, we explore two new innovations: we replace a DNN-based spectral mapper with a residual network that is more attuned to the goal of predicting clean speech. We also examine how integrating long term context in the mimic criterion (via wide-residual biLSTM networks) affects the performance of spectral mapping compared to DNNs. Our goal is to derive a model that can be used as a preprocessor for any recognition system; the features derived from our model are passed through the standard Kaldi ASR pipeline and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
