Polynomial Mixing for Efficient Self-supervised Speech Encoders
Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen

TL;DR
This paper introduces Polynomial Mixer, a novel token-mixing mechanism that replaces self-attention in speech encoders, achieving similar performance with significantly reduced computational complexity, enabling more scalable speech recognition models.
Contribution
The paper presents Polynomial Mixer (PoM), a new linear-complexity token-mixing method for self-supervised speech encoders, improving efficiency while maintaining competitive accuracy.
Findings
PoM achieves comparable word error rates to full self-attention.
PoM reduces memory and computation costs significantly.
PoM offers an improved efficiency-performance trade-off.
Abstract
State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
