Spoken Language Modeling with Duration-Penalized Self-Supervised Units
Nicol Visser, Herman Kamper

TL;DR
This paper explores how the size of acoustic units and their coarseness affect spoken language model performance, revealing that coarser units can be beneficial for certain tasks when using the DPDP method.
Contribution
It introduces the duration-penalized dynamic programming (DPDP) method to optimize unit coarseness and analyzes its impact across various linguistic tasks.
Findings
Coarser units improve sentence resynthesis performance.
Coarser units yield higher accuracy at lower bitrates in language modeling.
Appropriate codebook size is crucial for optimal performance.
Abstract
Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
