Exploring Length Generalization For Transformer-based Speech Enhancement
Qiquan Zhang, Hongxu Zhu, Xinyuan Qian, Eliathamby Ambikairajah, and Haizhou Li

TL;DR
This paper investigates how Transformer-based speech enhancement models can generalize to longer speech utterances than those seen during training, focusing on the role of positional encoding methods.
Contribution
The study introduces a simple learnable positional encoding scheme, LearnLin, which improves length generalization in Transformer speech enhancement models.
Findings
Relative positional encoding outperforms absolute encoding for length generalization.
LearnLin achieves superior length generalization with comparable or better performance.
Positional encoding significantly influences the model's ability to handle longer utterances.
Abstract
Transformer network architecture has proven effective in speech enhancement. However, as its core module, self-attention suffers from quadratic complexity, making it infeasible for training on long speech utterances. In practical scenarios, speech enhancement models are often required to perform on noisy speech at run-time that is substantially longer than the training utterances. It remains a challenge how a Transformer-based speech enhancement model can generalize to long speech utterances. In this paper, extensive empirical studies are conducted to explore the model's length generalization ability. In particular, we conduct speech enhancement experiments on four training objectives and evaluate with five metrics. Our studies establish that positional encoding is an effective instrument to dampen the effect of utterance length on speech enhancement. We first explore several existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis
MethodsSoftmax · Attention Is All You Need
