SyllableLM: Learning Coarse Semantic Units for Speech Language Models

Alan Baade; Puyuan Peng; David Harwath

arXiv:2410.04029·cs.CL·October 8, 2024

SyllableLM: Learning Coarse Semantic Units for Speech Language Models

Alan Baade, Puyuan Peng, David Harwath

PDF

Open Access 1 Repo

TL;DR

This paper introduces SyllableLM, a speech language model that uses controllably coarser semantic units derived from speech, leading to improved efficiency and performance in spoken language tasks.

Contribution

The paper presents a novel self-supervised method for extracting syllable-like units from speech, enabling more efficient and semantically meaningful tokenization for speech language models.

Findings

01

Achieves state-of-the-art syllabic segmentation and clustering.

02

Reduces training compute by 30 times and inference time by 4 times.

03

Matches or outperforms existing speech language models on various tasks.

Abstract

Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alanbaade/SyllableLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques

MethodsALIGN