Exploring Length Generalization For Transformer-based Speech Enhancement

Qiquan Zhang; Hongxu Zhu; Xinyuan Qian; Eliathamby Ambikairajah; and Haizhou Li

arXiv:2506.06697·eess.AS·June 10, 2025

Exploring Length Generalization For Transformer-based Speech Enhancement

Qiquan Zhang, Hongxu Zhu, Xinyuan Qian, Eliathamby Ambikairajah, and Haizhou Li

PDF

Open Access

TL;DR

This paper investigates how Transformer-based speech enhancement models can generalize to longer speech utterances than those seen during training, focusing on the role of positional encoding methods.

Contribution

The study introduces a simple learnable positional encoding scheme, LearnLin, which improves length generalization in Transformer speech enhancement models.

Findings

01

Relative positional encoding outperforms absolute encoding for length generalization.

02

LearnLin achieves superior length generalization with comparable or better performance.

03

Positional encoding significantly influences the model's ability to handle longer utterances.

Abstract

Transformer network architecture has proven effective in speech enhancement. However, as its core module, self-attention suffers from quadratic complexity, making it infeasible for training on long speech utterances. In practical scenarios, speech enhancement models are often required to perform on noisy speech at run-time that is substantially longer than the training utterances. It remains a challenge how a Transformer-based speech enhancement model can generalize to long speech utterances. In this paper, extensive empirical studies are conducted to explore the model's length generalization ability. In particular, we conduct speech enhancement experiments on four training objectives and evaluate with five metrics. Our studies establish that positional encoding is an effective instrument to dampen the effect of utterance length on speech enhancement. We first explore several existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis

MethodsSoftmax · Attention Is All You Need