Spoken Language Modeling with Duration-Penalized Self-Supervised Units

Nicol Visser; Herman Kamper

arXiv:2505.23494·cs.CL·May 30, 2025

Spoken Language Modeling with Duration-Penalized Self-Supervised Units

Nicol Visser, Herman Kamper

PDF

Open Access 1 Repo

TL;DR

This paper explores how the size of acoustic units and their coarseness affect spoken language model performance, revealing that coarser units can be beneficial for certain tasks when using the DPDP method.

Contribution

It introduces the duration-penalized dynamic programming (DPDP) method to optimize unit coarseness and analyzes its impact across various linguistic tasks.

Findings

01

Coarser units improve sentence resynthesis performance.

02

Coarser units yield higher accuracy at lower bitrates in language modeling.

03

Appropriate codebook size is crucial for optimal performance.

Abstract

Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nicolvisser/dp-slm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling