Sylber: Syllabic Embedding Representation of Speech from Raw Audio

Cheol Jun Cho; Nicholas Lee; Akshat Gupta; Dhruv Agarwal; Ethan Chen,; Alan W Black; Gopala K. Anumanchipalli

arXiv:2410.07168·cs.CL·March 4, 2025

Sylber: Syllabic Embedding Representation of Speech from Raw Audio

Cheol Jun Cho, Nicholas Lee, Akshat Gupta, Dhruv Agarwal, Ethan Chen,, Alan W Black, Gopala K. Anumanchipalli

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

Sylber introduces a self-supervised learning framework that produces structured syllabic speech representations, enabling efficient tokenization, robust segmentation, and effective speech compression suitable for language modeling.

Contribution

The paper presents a novel SSL method for syllabic speech embedding that achieves fast, robust segmentation and efficient tokenization without domain tuning.

Findings

01

Achieves an average of 4.27 tokens per second for syllable segmentation

02

Enables speech reconstruction with lower bitrate than baseline SSL tokens

03

Categorical perception naturally emerges in Sylber embeddings

Abstract

Syllables are compositional units of spoken language that efficiently structure human speech perception and production. However, current neural speech representations lack such structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised learning (SSL) framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling. Our proposed segmentation…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The framework is well-motivated, particularly due to the efficiency of its tokenization algorithm, which helps manage the exponentially increasing compute costs associated with transformer-based models in downstream tasks.

Weaknesses

While the approach is well-motivated as an efficient alternative for speech tokenization, achieving an average rate of 4.27 tokens per second, the evaluation metrics used don’t fully justify the applicability of these tokens for down stream tasks, as shown in Table 9. It would be beneficial for the authors to moderate some claims, such as: • that syllabic units are better suited for lexical and syntactic understanding • and that these units are better suited for SLU Instead, the focus could

Reviewer 02Rating 8Confidence 5

Strengths

-The paper investigates an important and timely topic, namely speech tokenization focused on learning linguistically-motivated large granularity units. -The proposed method for learning the units is simple and effective. -The experiments are broad, covering categorical perception to syllable segmentation to resynthesis to spoken language understanding tasks with speech LMs -The experimental results are strong on all tasks evaluated.

Weaknesses

The main weakness from my perspective is that I would have liked to see a more in-depth analysis of the resynthesis results in terms of naturalness. It is expected that when moving from low-level acoustic units to higher level syllable-like units, we may lose a lot of the low-level details that are unnecessary for higher level understanding but are needed to represent highly naturalistic speech. However, when building speech LMs we often want to re-synthesize their outputs so they may be played

Reviewer 03Rating 8Confidence 4

Strengths

1. The authors presented extensive experiments to demonstrate the strengths of Sylber, covering syllable segmentation, spoken language understanding, and audio codec. 2. The propose learning approach is intuitive and having linear time segmentation algorithm would also greatly facilitate utilization of syllable level features into downstream tasks such as spoken language modeling with syllable level tokens. 3. Authors also presented interesting qualitative analysis in Section 4 connecting syllab

Weaknesses

1. Some details and ablation studies are missing: 1. The proposed method regresses non-speech frames to zero. What model / method was used to determine whether a frame is speech or non-speech? 2. The authors claim that better features motivate the design of the linear-time greedy segmentation algorithm. For SD-HuBERT and Sylber, how much does that segmentation algorithm affect the performance? It would be clear if the authors can report Table 1 results with all combinations of (SSL featu

Code & Models

Repositories

Berkeley-Speech-Group/sylber
pytorchOfficial

Models

🤗
cheoljun95/sylber
model· ♡ 5
♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing