Sylber 2.0: A Universal Syllable Embedding
Cheol Jun Cho, Nicholas Lee, Alan W Black, and Gopala K. Anumanchipalli

TL;DR
Sylber 2.0 introduces a universal, efficient syllable-based speech coding framework that captures detailed acoustic and linguistic features across languages, enabling high-quality TTS and improved low-resource ASR performance.
Contribution
It presents Sylber 2.0, a self-supervised syllable embedding model that achieves low-frequency, high-fidelity speech representation across multiple languages and expressive styles.
Findings
Performs on par with high-frequency models in speech tasks.
Enables TTS with competitive quality using only 72M parameters.
Provides effective features for low-resource ASR.
Abstract
Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. To address this gap, we present Sylber 2.0, a self-supervised framework for coding speech at the syllable level that enables efficient temporal compression and high-fidelity reconstruction. Sylber 2.0 achieves a very low token frequency around 5 Hz, while retaining both linguistic and acoustic detail across multiple languages and expressive styles. Experiments show that it performs on par with previous models operating on high-frequency baselines. Furthermore, Sylber 2.0 enables efficient TTS modeling which can generate speech with competitive intelligibility and quality with SOTA models using only 72M…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Achieves lowest reported token frequency (4.8 Hz average) for multilingual speech, dramatically reducing computational costs for downstream modeling compared to existing methods (12.5-86 Hz) 2. Successfully learns syllabic structure across 102 languages without text supervision, addressing the major limitation of prior work (Sylber 1.0) which only handled English
1. **Insufficient downstream task validation**: The paper proposes a new speech representation primarily motivated by speech language modeling, yet only a small portion (lines 462-476) demonstrates its usage. Only TTS results are shown, which is insufficient since TTS is relatively simple and can be trained effectively with Mel-Spectrogram + Vocoder without any tokenization. The paper needs to justify the benefits of the proposed embedding for more diverse downstream tasks and identify which tas
1) The proposed Sylber 2.0 achieves an impressive speech compression rate with token frequency of around 5Hz on many languages, while retaining both linguistic and acoustic details. Extensive experiments and analysis manifest the effectiveness of this method. 2) This paper provides detailed implementation description, and presents good academic expression, data visualization, and result analysis.
1) The overall novelty is limited. This work extends the original Sylber system by adding the acoustic encoder and vocoder, and also improves the training process of content encoder. These changes are incremental. 2) This paper doesn't report quantitative ablation results of the proposed changes to the original Sylber.
- The goal of creating a universal, high-fidelity, and highly compressed speech token is a critical and high-impact challenge for the spoken language modeling community. Success here would enable models to process much longer speech contexts efficiently. - The paper introduces several intelligent and specific methodological ideas. The central concept of a disentangled (d, C, A) token is elegant. The syllable-guided acoustic encoder is a specific, new architecture designed to solve the well-know
- The paper lacks empirical validation for its core contribution. The acoustic encoder ('A' token) is presented as the key innovation for achieving high-fidelity reconstruction. However, the paper provides no ablation studies to prove its impact. The main comparison in Table 2 is against the original "Sylber," which is not an apples-to-apples comparison. The gains are confounded by multiple variables: a new (multilingual) dataset, a new (and likely better) vocoder, and the removal of silent mask
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
