SyllableLM: Learning Coarse Semantic Units for Speech Language Models
Alan Baade, Puyuan Peng, David Harwath

TL;DR
This paper introduces SyllableLM, a speech language model that uses controllably coarser semantic units derived from speech, leading to improved efficiency and performance in spoken language tasks.
Contribution
The paper presents a novel self-supervised method for extracting syllable-like units from speech, enabling more efficient and semantically meaningful tokenization for speech language models.
Findings
Achieves state-of-the-art syllabic segmentation and clustering.
Reduces training compute by 30 times and inference time by 4 times.
Matches or outperforms existing speech language models on various tasks.
Abstract
Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
MethodsALIGN
