Polynomial Mixing for Efficient Self-supervised Speech Encoders

Eva Feillet; Ryan Whetten; David Picard; Alexandre Allauzen

arXiv:2603.00683·cs.CL·March 3, 2026

Polynomial Mixing for Efficient Self-supervised Speech Encoders

Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen

PDF

Open Access

TL;DR

This paper introduces Polynomial Mixer, a novel token-mixing mechanism that replaces self-attention in speech encoders, achieving similar performance with significantly reduced computational complexity, enabling more scalable speech recognition models.

Contribution

The paper presents Polynomial Mixer (PoM), a new linear-complexity token-mixing method for self-supervised speech encoders, improving efficiency while maintaining competitive accuracy.

Findings

01

PoM achieves comparable word error rates to full self-attention.

02

PoM reduces memory and computation costs significantly.

03

PoM offers an improved efficiency-performance trade-off.

Abstract

State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling