DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Yuanyuan Wang; Dongchao Yang; Yiwen Shao; Hangting Chen; Jiankun Zhao; Zhiyong Wu; Helen Meng; Xixin Wu

arXiv:2508.08961·cs.SD·November 18, 2025

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu

PDF

Open Access 1 Video

TL;DR

DualSpeechLM introduces a unified speech understanding and generation model using dual speech token modeling, leveraging a novel tokenizer and training strategies to bridge modality gaps and enhance performance.

Contribution

The paper proposes USTokenizer for high-level semantic speech representation and a dual-token modeling framework, enabling effective joint speech understanding and generation in one model.

Findings

01

Effective speech understanding and generation achieved

02

Reduced modality gap between speech and text tokens

03

Enhanced training stability and performance

Abstract

Extending pre-trained text Large Language Models (LLMs)'s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques