DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

Yuxuan Lou; Ziming Wu; Yaochen Wang; Yong Liu; Yingxuan Ren; Fuming Lai; Shaobing Lian; Jie Tang; Yang You

arXiv:2601.22889·cs.CL·February 2, 2026

DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

Yuxuan Lou, Ziming Wu, Yaochen Wang, Yong Liu, Yingxuan Ren, Fuming Lai, Shaobing Lian, Jie Tang, Yang You

PDF

Open Access

TL;DR

This paper introduces DiffuSpeech, a diffusion-based speech-text model that generates reasoning traces and spoken responses simultaneously, improving speech QA accuracy and TTS quality by unifying understanding and generation.

Contribution

It presents the first diffusion-based speech-text model supporting joint reasoning and speech generation, along with a new speech QA dataset with reasoning traces.

Findings

01

Achieves state-of-the-art speech-to-speech QA accuracy, outperforming baselines by up to 9 points.

02

Attains the best TTS quality among generative models with 6.2% WER.

03

Confirms that diffusion architecture and reasoning traces enhance performance.

Abstract

Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Generative Adversarial Networks and Image Synthesis