A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
Dingdong Wang, Mingyu Cui, Dongchao Yang, Xueyuan Chen, and Helen Meng

TL;DR
This study compares discrete and continuous speech tokens in large language models, revealing continuous features generally outperform discrete ones in semantic tasks and identifying key limitations of discrete tokens.
Contribution
It provides a comprehensive comparison between discrete and continuous speech features in LLMs and analyzes reasons for the underperformance of discrete tokens.
Findings
Continuous features outperform discrete tokens in semantic tasks.
Discrete tokens have limitations like limited granularity and inefficient information retention.
Analysis offers insights for improving discrete speech tokens.
Abstract
With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored. In this paper, we present a fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM (Qwen1.5-0.5B). Our findings reveal that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding. Moreover, this study goes beyond surface-level comparison by identifying key factors behind the under-performance of discrete tokens, such as limited token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsFocus
