Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

Haibin Wu; Yuxuan Hu; Ruchao Fan; Xiaofei Wang; Kenichi Kumatani; Bo Ren; Jianwei Yu; Heng Lu; Lijuan Wang; Yao Qian; Jinyu Li

arXiv:2506.04518·eess.AS·February 12, 2026

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

Haibin Wu, Yuxuan Hu, Ruchao Fan, Xiaofei Wang, Kenichi Kumatani, Bo Ren, Jianwei Yu, Heng Lu, Lijuan Wang, Yao Qian, Jinyu Li

PDF

Open Access

TL;DR

This paper compares different speech-text decoding strategies in speech language models, introduces a faster interleaved decoding method with improved performance, and enhances speech question answering through curated datasets.

Contribution

It systematically evaluates joint decoding paradigms, proposes a novel early-stop interleaved method for faster decoding, and improves speech QA with curated datasets.

Findings

01

Interleaved decoding achieves best alignment.

02

Early-stop interleaved significantly speeds up decoding.

03

Curated QA datasets improve speech QA performance.

Abstract

Speech language models (Speech LMs) enable end-to-end speech-text modeling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies, including the interleaved, and parallel generation paradigms, under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsBalanced Selection