Sylber: Syllabic Embedding Representation of Speech from Raw Audio
Cheol Jun Cho, Nicholas Lee, Akshat Gupta, Dhruv Agarwal, Ethan Chen,, Alan W Black, Gopala K. Anumanchipalli

TL;DR
Sylber introduces a self-supervised learning framework that produces structured syllabic speech representations, enabling efficient tokenization, robust segmentation, and effective speech compression suitable for language modeling.
Contribution
The paper presents a novel SSL method for syllabic speech embedding that achieves fast, robust segmentation and efficient tokenization without domain tuning.
Findings
Achieves an average of 4.27 tokens per second for syllable segmentation
Enables speech reconstruction with lower bitrate than baseline SSL tokens
Categorical perception naturally emerges in Sylber embeddings
Abstract
Syllables are compositional units of spoken language that efficiently structure human speech perception and production. However, current neural speech representations lack such structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised learning (SSL) framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling. Our proposed segmentation…
Peer Reviews
Decision·ICLR 2025 Poster
The framework is well-motivated, particularly due to the efficiency of its tokenization algorithm, which helps manage the exponentially increasing compute costs associated with transformer-based models in downstream tasks.
While the approach is well-motivated as an efficient alternative for speech tokenization, achieving an average rate of 4.27 tokens per second, the evaluation metrics used don’t fully justify the applicability of these tokens for down stream tasks, as shown in Table 9. It would be beneficial for the authors to moderate some claims, such as: • that syllabic units are better suited for lexical and syntactic understanding • and that these units are better suited for SLU Instead, the focus could
-The paper investigates an important and timely topic, namely speech tokenization focused on learning linguistically-motivated large granularity units. -The proposed method for learning the units is simple and effective. -The experiments are broad, covering categorical perception to syllable segmentation to resynthesis to spoken language understanding tasks with speech LMs -The experimental results are strong on all tasks evaluated.
The main weakness from my perspective is that I would have liked to see a more in-depth analysis of the resynthesis results in terms of naturalness. It is expected that when moving from low-level acoustic units to higher level syllable-like units, we may lose a lot of the low-level details that are unnecessary for higher level understanding but are needed to represent highly naturalistic speech. However, when building speech LMs we often want to re-synthesize their outputs so they may be played
1. The authors presented extensive experiments to demonstrate the strengths of Sylber, covering syllable segmentation, spoken language understanding, and audio codec. 2. The propose learning approach is intuitive and having linear time segmentation algorithm would also greatly facilitate utilization of syllable level features into downstream tasks such as spoken language modeling with syllable level tokens. 3. Authors also presented interesting qualitative analysis in Section 4 connecting syllab
1. Some details and ablation studies are missing: 1. The proposed method regresses non-speech frames to zero. What model / method was used to determine whether a frame is speech or non-speech? 2. The authors claim that better features motivate the design of the linear-time greedy segmentation algorithm. For SD-HuBERT and Sylber, how much does that segmentation algorithm affect the performance? It would be clear if the authors can report Table 1 results with all combinations of (SSL featu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
