LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

Daejin Jo; Jeeyoung Yun; Byungseok Roh; Sungwoong Kim

arXiv:2506.16738·cs.CL·June 23, 2025

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

Daejin Jo, Jeeyoung Yun, Byungseok Roh, Sungwoong Kim

PDF

Open Access

TL;DR

LM-SPT introduces a novel semantic distillation approach for speech tokenization, reconstructing speech from semantic tokens to produce units better aligned with language models, improving efficiency and downstream task performance.

Contribution

The paper proposes LM-SPT, a new speech tokenization method that uses semantic reconstruction and architectural enhancements to produce more semantically aligned discrete speech units.

Findings

01

Achieves higher reconstruction fidelity than baselines.

02

SLMs trained with LM-SPT tokens perform well on speech-to-text tasks.

03

Outperforms baselines on text-to-speech tasks.

Abstract

With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsALIGN