Efficient Monaural Speech Enhancement using Spectrum Attention Fusion
Jinyu Long, Jetic G\=u, Binhao Bai, Zhibo Yang, Ping Wei, and Junli Li

TL;DR
This paper introduces Spectrum Attention Fusion, a novel approach that reduces the complexity of Transformer-based speech enhancement models while maintaining or improving performance, making them more efficient for practical use.
Contribution
The paper proposes Spectrum Attention Fusion, a method that replaces multiple self-attention layers with a convolutional module to reduce model size and computational cost.
Findings
Achieves comparable or better results than state-of-the-art models
Uses significantly fewer parameters (0.58M)
Maintains high speech enhancement quality
Abstract
Speech enhancement is a demanding task in automated speech processing pipelines, focusing on separating clean speech from noisy channels. Transformer based models have recently bested RNN and CNN models in speech enhancement, however at the same time they are much more computationally expensive and require much more high quality training data, which is always hard to come by. In this paper, we present an improvement for speech enhancement models that maintains the expressiveness of self-attention while significantly reducing model complexity, which we have termed Spectrum Attention Fusion. We carefully construct a convolutional module to replace several self-attention layers in a speech Transformer, allowing the model to more efficiently fuse spectral features. Our proposed model is able to achieve comparable or better results against SOTA models but with significantly smaller…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Dense Connections · Label Smoothing · Dropout · Absolute Position Encodings · Byte Pair Encoding
