Speech Codec Probing from Semantic and Phonetic Perspectives
Xuan Shi, Chang Zeng, Tiantian Feng, Shih-Heng Wang, Jianbo Ma, Shrikanth Narayanan

TL;DR
This paper systematically analyzes speech tokenizers, revealing they mainly encode phonetic rather than semantic information, which impacts multimodal LLM performance and guides future tokenizer design.
Contribution
It provides a detailed analysis of existing speech tokenizers' semantic and phonetic encoding, highlighting their phonetic bias and implications for improving speech representations.
Findings
Current tokenizers mainly capture phonetic information.
Semantic content in tokenizers does not align with text semantics.
Insights for designing better speech tokenization methods.
Abstract
Speech tokenizers are essential for connecting speech to large language models (LLMs) in multimodal systems. These tokenizers are expected to preserve both semantic and acoustic information for downstream understanding and generation. However, emerging evidence suggests that what is termed "semantic" in speech representations does not align with text-derived semantics: a mismatch that can degrade multimodal LLM performance. In this paper, we systematically analyze the information encoded by several widely used speech tokenizers, disentangling their semantic and phonetic content through word-level probing tasks, layerwise representation analysis, and cross-modal alignment metrics such as CKA. Our results show that current tokenizers primarily capture phonetic rather than lexical-semantic structure, and we derive practical implications for the design of next-generation speech tokenization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems
