An Exploration of Length Generalization in Transformer-Based Speech Enhancement
Qiquan Zhang, Hongxu Zhu, Xinyuan Qian, Eliathamby Ambikairajah,, Haizhou Li

TL;DR
This paper investigates how Transformer-based speech enhancement models can generalize across different utterance lengths, emphasizing the role of position embeddings, especially relative position embeddings, in improving length generalization.
Contribution
It systematically explores the impact of various position embedding schemes on length generalization in Transformer speech enhancement models, highlighting the effectiveness of relative position embeddings.
Findings
Relative position embeddings outperform absolute position embeddings in length generalization.
Position embeddings significantly alleviate the impact of utterance length on model performance.
The study provides practical insights for designing more robust Transformer-based speech enhancement systems.
Abstract
The use of Transformer architectures has facilitated remarkable progress in speech enhancement. Training Transformers using substantially long speech utterances is often infeasible as self-attention suffers from quadratic complexity. It is a critical and unexplored challenge for a Transformer-based speech enhancement model to learn from short speech utterances and generalize to longer ones. In this paper, we conduct comprehensive experiments to explore the length generalization problem in speech enhancement with Transformer. Our findings first establish that position embedding provides an effective instrument to alleviate the impact of utterance length on Transformer-based speech enhancement. Specifically, we explore four different position embedding schemes to enable length generalization. The results confirm the superiority of relative position embeddings (RPEs) over absolute PE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques
MethodsLinear Layer · Multi-Head Attention · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam
