Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation
Xuan Du, Qiangyu Yan, Wenshuo Li, Borui Jiang, Changming Xiao, Han Shu, Xinghao Chen

TL;DR
This paper introduces InterRS, a novel method for real-time speech generation that interleaves reasoning steps with speech, improving fluency and reasoning accuracy in AI communication.
Contribution
The paper presents a new pipeline for generating interleaved reasoning and speech data, along with training techniques that enhance naturalness and reasoning performance in speech generation.
Findings
Achieves 13% better performance on mathematical and logic benchmarks.
Generates instant, fluent responses comparable to spoken-language models.
Produces more natural and fluent answers than prior methods.
Abstract
The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
