XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs
Yitian Gong, Luozhijie Jin, Ruifan Deng, Dong Zhang, Xin Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, Xipeng Qiu

TL;DR
XY-Tokenizer introduces a multi-stage, multi-task learning speech codec that effectively balances semantic richness and acoustic fidelity, enabling high-quality reconstruction and semantic understanding at low bitrates.
Contribution
It presents XY-Tokenizer, a novel codec that mitigates semantic-acoustic conflicts, outperforming prior codecs in both semantic alignment and speaker similarity at low bitrates.
Findings
Achieves strong text alignment surpassing SpeechTokenizer and Mimi.
Maintains speaker similarity score of 0.83, close to state-of-the-art acoustic codecs.
Comparable reconstruction performance to BigCodec at similar bitrates.
Abstract
Speech codecs serve as bridges between speech signals and large language models. An ideal codec for speech language models should not only preserve acoustic information but also capture rich semantic information. However, existing speech codecs struggle to balance high-quality audio reconstruction with ease of modeling by language models. In this study, we analyze the limitations of previous codecs in balancing semantic richness and acoustic fidelity. We propose XY-Tokenizer, a novel codec that mitigates the conflict between semantic and acoustic capabilities through multi-stage, multi-task learning. Experimental results demonstrate that XY-Tokenizer achieves performance in both semantic and acoustic tasks comparable to that of state-of-the-art codecs operating at similar bitrates, even though those existing codecs typically excel in only one aspect. Specifically, XY-Tokenizer achieves…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The joint modelling of ASR and reconstruction is an interesting and well-motivated approach that could inspire future research bridging reconstruction and transcription tasks. 2. The dual-tower design (semantic and acoustic encoders with minimal parameter sharing) provides a clear architectural solution to mitigate the semantic–acoustic conflict. 3. The two-stage training of ASR/reconstruction pre-training followed by GAN-based post-training offers a sound recipe to balance intelligibility an
1. The claim that “XY-Tokenizer is the first speech codec to successfully model both semantic and acoustic information effectively at low bitrates” is overstated. Prior works such as SpeechTokenizer and Mimi already combined semantic acoustic information, while XCodec model both simultaneously. The lower bitrate argument is also weakly motivated, as established codecs (e.g., SpeechTokenizer) can easily operate at lower bitrates by reducing quantizer depth. 2. In Table 3, XY-Tokenizer does not cl
The work investigates an important and relevant topic of speech tokenizer training that emphasizes not only on the content but also on the quality of the reconstructed speech. The paper clearly outlines the goals of the work presented, how it differs from prior art and compares performance with respect to prior art using relevant objective metrics. The work addresses the weaknesses in existing tokenizer training approaches and elaborates on how it impacts either the quality or the content of t
The results presented in this work are convincing, however mostly focuses on objective measure. A subjective evaluation, in the form of mean opinion score would have made the content and the results presented in the paper more convincing. While objective measures are often used for assessing speech quality, but it is also true that human perceptual measures are often more reliable metrics when assessing the quality and semantic structure of generated speech.
1. The paper identifies a real problem: the semantic–acoustic tradeoff in speech codecs, which is relevant to Speech LLMs and audio foundation models. 2. The multi-stage training pipeline (ASR + reconstruction, then GAN fine-tuning) is systematically described and experimentally validated. 3. Evaluations cover semantic (ASR WER) and acoustic (SIM, STOI, PESQ) metrics. The results demonstrate improvements over baselines, particularly at low bitrate (1 kbps), suggesting good engineering effort
1. Limited novelty — ASR-based loss is already standard. The central methodological component, the ASR-based semantic loss, is not new. Recent audio tokenizer such as Baichuan-Audio and CosyVoice-3 already use similar strategies — integrating an ASR-guided objective to enhance semantic representation in codecs. Thus, the contribution of XY-Tokenizer is incremental. 2. Lack of justification for using ASR supervision. The paper does not validate the necessity of the ASR-based loss. Because the mo
1. XY-Codec achieves a superior balance among reconstruction quality, semantic awareness, frame rate and model size. 2. The observation that “decreasing the number of shared parameters between semantic and acoustic modeling mitigates semantic-acoustic conflict” is insightful. 3. The idea of separating semantic and acoustic channels and incorporating pretrained Whisper as the semantic encoder provides a novel solution to the semantic-acoustic conflict problem.
1. **Marginal Improvement**: The semantic ability and model size of XY-Tokenizer is on-par with the baseline Baichuan method, while the improvement on reconstruction quality seems marginal. There is still a significant gap between reconstruction quality of XY-Tokenizer and higher-bitrate baselines. 2. **Generalizability**: XY-Tokenizer is trained on Emilia, a speech-only dataset for TTS. Additionally, the acoustic encoder needs to be initialized with Whisper parameters. This means that XY-Tokeni
The paper addresses the timely problem in speech LLMs of efficiently tokenizing speech while preserving both semantic content and acoustic details. As speech LLMs gain prominence, solving this problem has important implications for practical deployment. The paper attempts to validate the approach across multiple evaluation dimensions including reconstruction metrics, text alignment, cross-lingual settings, and out-of-domain scenarios.
The introduction would benefit from stronger motivation and clearer framing of concepts. Core concepts like "semantic tokens" and "acoustic tokens" should be defined for a broader audience, and it would help to explicitly explain why speech LLMs benefit from modeling both semantic and acoustic information simultaneously. The authors should consider establishing the problem and its importance more clearly before introducing the technical contribution. In addition, the introduction mentions that s
Code & Models
- 🤗bosonai/higgs-audio-v2-tokenizermodel· 1.9k dl· ♡ 461.9k dl♡ 46
- 🤗MCplayer/XY_Tokenizermodel· 8 dl8 dl
- 🤗PierrunoYT/higgs-audio-v2-tokenizermodel· 9 dl9 dl
- 🤗gaoyang07/XY_Tokenizermodel· 1 dl1 dl
- 🤗OpenMOSS-Team/XY_Tokenizer_TTSD_V0_hfmodel· 181 dl· ♡ 1181 dl♡ 1
- 🤗OpenMOSS-Team/XY_Tokenizer_TTSD_V0_32k_hfmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
