ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling
Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper

TL;DR
ZeroSyl introduces a simple, training-free method to extract syllable-like units from raw speech using a frozen WavLM model, enabling effective spoken language modeling without complex pipelines.
Contribution
It proposes ZeroSyl, a novel approach that directly derives syllable boundaries and embeddings from a pre-trained model, simplifying the process of syllable tokenization for speech language models.
Findings
ZeroSyl achieves competitive syllable segmentation performance.
Outperforms prior syllabic tokenizers on multiple benchmarks.
Finer units benefit lexical tasks, while syllabic units improve syntactic modeling.
Abstract
Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research
