A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement
Bengt J. Borgstrom, Michael S. Brandstein

TL;DR
This paper introduces a multiscale autoencoder framework for end-to-end neural speech enhancement, leveraging spectral decomposition across multiple scales to improve performance over traditional methods.
Contribution
It presents a novel multiscale autoencoder architecture with flexible spectral band design, fully differentiable components, and demonstrated superiority over existing systems.
Findings
Outperforms conventional single-branch autoencoders.
Achieves better speech quality metrics.
Improves automatic speech recognition accuracy.
Abstract
Neural network approaches to single-channel speech enhancement have received much recent attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Ultrasonics and Acoustic Wave Propagation · Speech Recognition and Synthesis
