Speech Codec Probing from Semantic and Phonetic Perspectives

Xuan Shi; Chang Zeng; Tiantian Feng; Shih-Heng Wang; Jianbo Ma; Shrikanth Narayanan

arXiv:2603.10371·eess.AS·March 12, 2026

Speech Codec Probing from Semantic and Phonetic Perspectives

Xuan Shi, Chang Zeng, Tiantian Feng, Shih-Heng Wang, Jianbo Ma, Shrikanth Narayanan

PDF

Open Access

TL;DR

This paper systematically analyzes speech tokenizers, revealing they mainly encode phonetic rather than semantic information, which impacts multimodal LLM performance and guides future tokenizer design.

Contribution

It provides a detailed analysis of existing speech tokenizers' semantic and phonetic encoding, highlighting their phonetic bias and implications for improving speech representations.

Findings

01

Current tokenizers mainly capture phonetic information.

02

Semantic content in tokenizers does not align with text semantics.

03

Insights for designing better speech tokenization methods.

Abstract

Speech tokenizers are essential for connecting speech to large language models (LLMs) in multimodal systems. These tokenizers are expected to preserve both semantic and acoustic information for downstream understanding and generation. However, emerging evidence suggests that what is termed "semantic" in speech representations does not align with text-derived semantics: a mismatch that can degrade multimodal LLM performance. In this paper, we systematically analyze the information encoded by several widely used speech tokenizers, disentangling their semantic and phonetic content through word-level probing tasks, layerwise representation analysis, and cross-modal alignment metrics such as CKA. Our results show that current tokenizers primarily capture phonetic rather than lexical-semantic structure, and we derive practical implications for the design of next-generation speech tokenization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems