Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models
Shunsuke Kando, Yusuke Miyao, Shinnosuke Takamichi

TL;DR
This study examines how segmentation width and vocabulary size in speech tokenization impact speech language model performance, revealing that coarser segmentation and larger cluster sizes improve efficiency and understanding.
Contribution
It provides a systematic analysis of segmentation and vocabulary choices, demonstrating their effects on model efficiency and performance in zero-shot spoken language understanding.
Findings
Moderately coarse segmentation improves model performance.
Larger cluster sizes enhance discrete unit quality.
Best models reduce training data by 50% and training time by 70%.
Abstract
The purpose of speech tokenization is to transform a speech signal into a sequence of discrete representations, serving as the foundation for speech language models (SLMs). While speech tokenization has many options, their effect on the performance of SLMs remains unclear. This paper investigates two key aspects of speech tokenization: the segmentation width and the cluster size of discrete units. First, we segment speech signals into fixed/variable widths and pooled representations. We then train K-means models in multiple cluster sizes. Through the evaluation on zero-shot spoken language understanding benchmarks, we find the positive effect of moderately coarse segmentation and bigger cluster size. Notably, among the best-performing models, the most efficient one achieves a 50% reduction in training data and a 70% decrease in training runtime. Our analysis highlights the importance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
