Efficient Transformer-based Speech Enhancement Using Long Frames and   STFT Magnitudes

Danilo de Oliveira; Tal Peer; Timo Gerkmann

arXiv:2206.11703·eess.AS·June 6, 2023

Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes

Danilo de Oliveira, Tal Peer, Timo Gerkmann

PDF

Open Access

TL;DR

This paper demonstrates that replacing learned encoder features with STFT magnitudes in a transformer-based speech enhancement model allows for longer input frames, reducing computational complexity significantly while maintaining high enhancement quality.

Contribution

The study introduces a method to use long frames in transformer-based speech enhancement by substituting learned features with STFT magnitudes, improving efficiency.

Findings

01

Equivalent speech quality and intelligibility scores achieved.

02

Computational operations reduced by approximately 8 times.

03

Long frames can be used without performance loss.

Abstract

The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of frames at the input, which is problematic; since the SepFormer is transformer-based, its computational complexity drastically increases with longer sequences. In this paper, we employ the SepFormer in a speech enhancement task and show that by replacing the learned-encoder features with a magnitude short-time Fourier transform (STFT) representation, we can use long frames without compromising perceptual enhancement performance. We obtained equivalent quality and intelligibility evaluation scores while reducing the number of operations by a factor of approximately 8 for a 10-second utterance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Ultrasonics and Acoustic Wave Propagation

MethodsAttention Is All You Need · Dense Connections · Parameterized ReLU · Softmax · Residual Connection · Layer Normalization · Position-Wise Feed-Forward Layer · *Communicated@Fast*How Do I Communicate to Expedia? · Multi-Head Attention · Linear Layer