STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models
Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

TL;DR
STITCH enables spoken language models to perform internal reasoning while speaking, reducing latency and improving performance on reasoning tasks by alternating between generating unspoken reasoning and spoken responses.
Contribution
The paper introduces Stitch, a novel method for simultaneous internal reasoning and speaking in spoken language models, reducing latency and enhancing reasoning capabilities.
Findings
Outperforms baselines by 15% on math reasoning datasets
Matches latency of models without unspoken reasoning
Performs equally well on non-reasoning datasets
Abstract
Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate…
Peer Reviews
Decision·ICLR 2026 Poster
- The idea is novel — enabling SLMs to think internally while speaking. The chunked reasoning design, STITCH-R and STITCH-S is creative and practical for real-time systems. - Overall the paper is very well-written
- The experiments mainly focus on math reasoning datasets like GSM8K and SVAMP. It would be valuable to test STITCH on more diverse reasoning domains such as commonsense, dialogue reasoning, or multi-hop factual reasoning.
- The authors introduce three ways of integrating reasoning into spoken language models: Thinking Before Speech (TBS), Simultaneous Thinking and Talking with Reasoning First (STITCH-R), and Simultaneous Thinking and Talking with Speaking First (STITCH-S). The methodology is clearly described, and Figure 2 effectively visualizes the differences between these methods. - The analysis is good. The authors report performance while varying the length of reasoning chunks during inference and analyze th
- Based on the performance tables (Table 1-(a) and 1-(b)), there is no clear winner that consistently outperforms all other baselines. On math datasets, STITCH-R and STITCH-S show mixed performance across models, sometimes performing significantly worse than TBS (e.g., TBS 64.94, STITCH-R 58.70, STITCH-S 56.72). Similarly, on non-reasoning datasets, STITCH-R and STITCH-S perform inconsistently relative to other baselines. The paper does not clearly explain the reasons behind these trends. - Foll
**[S1]** The paper is written clearly and is easy to follow. The operational mechanism of the proposed method is straightforward to understand. **[S2]** Notably, the model exhibits comparable results on mathematical tasks compared to approaches that explicitly perform reasoning with reduced latency.
While the study presents an interesting direction, the scope of its **novelty and generalizability appears somewhat limited**. The following points are offered as considerations rather than criticisms: **[W1]** The reported optimization is based on the **A100 + vLLM** setting, which may limit the applicability of the results. It remains uncertain whether the proposed approach would generalize well to limited hardwares, larger models, or alternative architectures, such as those that jointly opti
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning
