LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization
Daejin Jo, Jeeyoung Yun, Byungseok Roh, Sungwoong Kim

TL;DR
LM-SPT introduces a novel semantic distillation approach for speech tokenization, reconstructing speech from semantic tokens to produce units better aligned with language models, improving efficiency and downstream task performance.
Contribution
The paper proposes LM-SPT, a new speech tokenization method that uses semantic reconstruction and architectural enhancements to produce more semantically aligned discrete speech units.
Findings
Achieves higher reconstruction fidelity than baselines.
SLMs trained with LM-SPT tokens perform well on speech-to-text tasks.
Outperforms baselines on text-to-speech tasks.
Abstract
With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsALIGN
