SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision

Maxime Poli; Mahi Luthra; Youssef Benchekroun; Yosuke Higuchi; Martin Gleize; Jiayi Shen; Robin Algayres; Yu-An Chung; Mido Assran; Juan Pino; Emmanuel Dupoux

arXiv:2512.20308·cs.CL·December 29, 2025

SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision

Maxime Poli, Mahi Luthra, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Jiayi Shen, Robin Algayres, Yu-An Chung, Mido Assran, Juan Pino, Emmanuel Dupoux

PDF

Open Access 1 Models

TL;DR

SpidR is a self-supervised speech representation model that learns stable, high-quality linguistic units directly from raw speech, enabling faster pretraining and improved language modeling without textual supervision.

Contribution

It introduces SpidR, a novel self-supervised model that stabilizes online clustering for speech units, reduces pretraining time, and outperforms existing models on language benchmarks.

Findings

01

SpidR outperforms wav2vec 2.0, HuBERT, WavLM, and DinoSR on key benchmarks.

02

SpidR's speech units correlate well with language modeling performance.

03

Pretraining time is significantly reduced to one day on 16 GPUs.

Abstract

The parallel advances in language modeling and speech representation learning have raised the prospect of learning language directly from speech without textual intermediates. This requires extracting semantic representations directly from speech. Our contributions are threefold. First, we introduce SpidR, a self-supervised speech representation model that efficiently learns representations with highly accessible phonetic information, which makes it particularly suited for textless spoken language modeling. It is trained on raw waveforms using a masked prediction objective combined with self-distillation and online clustering. The intermediate layers of the student model learn to predict assignments derived from the teacher's intermediate layers. This learning objective stabilizes the online clustering procedure compared to previous approaches, resulting in higher quality codebooks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
iliasslasri/robust_speech_quantizer
model· 12 dl
12 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques