Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement
Daniel Haider, Felix Perfler, Vincent Lostanlen, Martin Ehler, and Peter Balazs

TL;DR
This paper introduces a stable encoder-decoder architecture for speech enhancement that combines auditory filterbanks, frame theory, and spectral norms to improve training stability and speech quality.
Contribution
It proposes a hybrid approach integrating theory-driven and data-driven methods for training stable 1-D convolutional encoders in speech enhancement.
Findings
Significant improvement in PESQ scores.
Enhanced stability in training 1-D convolutional encoders.
Effective integration of auditory filterbanks and frame theory.
Abstract
Convolutional layers with 1-D filters are often used as frontend to encode audio signals. Unlike fixed time-frequency representations, they can adapt to the local characteristics of input data. However, 1-D filters on raw audio are hard to train and often suffer from instabilities. In this paper, we address these problems with hybrid solutions, i.e., combining theory-driven and data-driven approaches. First, we preprocess the audio signals via a auditory filterbank, guaranteeing good frequency localization for the learned encoder. Second, we use results from frame theory to define an unsupervised learning objective that encourages energy conservation and perfect reconstruction. Third, we adapt mixed compressed spectral norms as learning objectives to the encoder coefficients. Using these solutions in a low-complexity encoder-mask-decoder model significantly improves the perceptual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
