Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes
Danilo de Oliveira, Tal Peer, Timo Gerkmann

TL;DR
This paper demonstrates that replacing learned encoder features with STFT magnitudes in a transformer-based speech enhancement model allows for longer input frames, reducing computational complexity significantly while maintaining high enhancement quality.
Contribution
The study introduces a method to use long frames in transformer-based speech enhancement by substituting learned features with STFT magnitudes, improving efficiency.
Findings
Equivalent speech quality and intelligibility scores achieved.
Computational operations reduced by approximately 8 times.
Long frames can be used without performance loss.
Abstract
The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of frames at the input, which is problematic; since the SepFormer is transformer-based, its computational complexity drastically increases with longer sequences. In this paper, we employ the SepFormer in a speech enhancement task and show that by replacing the learned-encoder features with a magnitude short-time Fourier transform (STFT) representation, we can use long frames without compromising perceptual enhancement performance. We obtained equivalent quality and intelligibility evaluation scores while reducing the number of operations by a factor of approximately 8 for a 10-second utterance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Ultrasonics and Acoustic Wave Propagation
MethodsAttention Is All You Need · Dense Connections · Parameterized ReLU · Softmax · Residual Connection · Layer Normalization · Position-Wise Feed-Forward Layer · *Communicated@Fast*How Do I Communicate to Expedia? · Multi-Head Attention · Linear Layer
