Attention does not guarantee best performance in speech enhancement
Zhongshu Hou, Qinwen Hu, Kai Chen, Jing Lu

TL;DR
This paper investigates the effectiveness of attention mechanisms in speech enhancement, demonstrating that traditional global attention may not outperform RNNs due to the local nature of speech signals.
Contribution
The study challenges the assumption that attention always improves speech enhancement by empirically replacing attention with RNNs in SOTA models.
Findings
Replacing attention with RNNs does not degrade performance in tested models.
Local information may be more important than long-term dependencies in speech enhancement.
Attention mechanisms are not universally superior in speech enhancement tasks.
Abstract
Attention mechanism has been widely utilized in speech enhancement (SE) because theoretically it can effectively model the long-term inherent connection of signal both in time domain and spectrum domain. However, the generally used global attention mechanism might not be the best choice since the adjacent information naturally imposes more influence than the far-apart information in speech enhancement. In this paper, we validate this conjecture by replacing attention with RNN in two typical state-of-the-art (SOTA) models, multi-scale temporal frequency convolutional network (MTFAA) with axial attention and conformer-based metric-GAN network (CMGAN).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Indoor and Outdoor Localization Technologies
