DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion
Yuxuan Lou, Ziming Wu, Yaochen Wang, Yong Liu, Yingxuan Ren, Fuming Lai, Shaobing Lian, Jie Tang, Yang You

TL;DR
This paper introduces DiffuSpeech, a diffusion-based speech-text model that generates reasoning traces and spoken responses simultaneously, improving speech QA accuracy and TTS quality by unifying understanding and generation.
Contribution
It presents the first diffusion-based speech-text model supporting joint reasoning and speech generation, along with a new speech QA dataset with reasoning traces.
Findings
Achieves state-of-the-art speech-to-speech QA accuracy, outperforming baselines by up to 9 points.
Attains the best TTS quality among generative models with 6.2% WER.
Confirms that diffusion architecture and reasoning traces enhance performance.
Abstract
Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Generative Adversarial Networks and Image Synthesis
