Scaling Speech Tokenizers with Diffusion Autoencoders
Yuancheng Wang, Zhenyu Tang, Yun Wang, Arthur Hinsvark, Yingru Liu, Yinghao Li, Kainan Peng, Junyi Ao, Mingbo Ma, Mike Seltzer, Qing He, Xubo Liu

TL;DR
This paper introduces SiTok, a diffusion autoencoder-based speech tokenizer that effectively balances semantic understanding and audio reconstruction, achieving low bit and token rates with high fidelity on large-scale speech data.
Contribution
The paper presents a novel diffusion autoencoder approach for speech tokenization that scales to 1.6B parameters and outperforms existing methods in multiple tasks.
Findings
Outperforms baselines in understanding, reconstruction, and generation
Achieves low token rate of 12.5 Hz and bit-rate of 200 bits/sec
Scales to 1.6 billion parameters with 2 million hours of speech data
Abstract
Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of Hz and a bit-rate of 200 bits-per-second.
Peer Reviews
Decision·ICLR 2026 Poster
1. Performance: SiTok achieves superior performance in both speech reconstruction and understanding tasks at extremely low bit rates (0.2kbps) and token rates (12.5Hz), outperforming several strong baselines. This highlights its potential for efficient speech representation and modeling. 2. Innovative Design: The combination of diffusion models and semantic regularization is a novel approach in speech tokenization. These innovations enable SiTok to learn representations that are both acousticall
1. Lack of Direct TTS Task Evaluation: While the paper demonstrates strong performance on various speech understanding tasks such as ASR, ER, KS, and SV, it does not directly evaluate the proposed SiTok model on TTS tasks. This is a significant gap, as TTS tasks have unique requirements and challenges that may not be fully addressed by the current evaluation metrics. 2. Lack of Detail on Multi-Codebook CTC Decoder Implementation: The paper does not provide specific details on which layer the CTC
- The paper’s strength lies in its simple framework consists of a diffusion autoencoder, vector quantization, and CTC supervision. While not conceptually novel, it’s highly practical. The auxiliary CTC loss proves crucial (WER drops from 33.0 to 4.06, Table 3). Refinements like shortcut fine-tuning and classifier-free guidance further boost efficiency, reflecting thoughtful system design. - The experiments are comprehensive. Table 5 explores eight design factors (e.g., loss type, codebook setti
- The paper prefers mel-spectrograms over raw waveforms without strong justification. Modern architectures can handle raw waveforms efficiently, and prior speech tokenizers (e.g., EnCodec, WavTokenizer, BigCodec, Mimi) operate directly on them successfully. Claims that waveforms are inefficient or adversarially complex are unconvincing, especially given that adversarial methods have been applied effectively in prior work. Mel-spectrograms are lossy and require a separate vocoder (Vocos), yet no
- The paper scales the speech tokenizer up to a 1.6B-parameter model trained on 2 million hours of speech data, effectively enhancing model performance and achieving state-of-the-art results. - The authors conduct comprehensive ablation studies on various aspects of the tokenizer’s training process and offer insightful observations on effective training techniques.
- The current version appears to lack qualitative analyses or examples of actual reconstructed speech. Since audio reconstruction fidelity is critical for evaluating a speech tokenizer, it would be highly valuable to include qualitative results (e.g., audio samples) to better demonstrate the model’s effectiveness. **If such results are not presented in the rebuttal, I would consider decreasing my score.** - The main improvements in this work seem to stem from scaling the model and training data
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis
