An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
Ryan Whetten, Titouan Parcollet, Adel Moumen, Marco Dinarelli, Yannick, Est\`eve

TL;DR
This paper investigates replacing quadratic complexity self-attention in SSL speech models with linear alternatives, achieving significant VRAM savings and speed improvements while maintaining competitive performance.
Contribution
It systematically evaluates recent linear attention substitutes in SSL speech models, demonstrating their efficiency and effectiveness compared to traditional MHSA.
Findings
Linear attention methods reduce VRAM usage by 20-60%.
Speed increases range from 7% to 65%.
Performance remains competitive with MHSA.
Abstract
Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · Fastformer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
