STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Cheng-Han Chiang; Xiaofei Wang; Linjie Li; Chung-Ching Lin; Kevin Lin; Shujie Liu; Zhendong Wang; Zhengyuan Yang; Hung-yi Lee; Lijuan Wang

arXiv:2507.15375·cs.CL·February 10, 2026

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

PDF

Open Access 3 Reviews

TL;DR

STITCH enables spoken language models to perform internal reasoning while speaking, reducing latency and improving performance on reasoning tasks by alternating between generating unspoken reasoning and spoken responses.

Contribution

The paper introduces Stitch, a novel method for simultaneous internal reasoning and speaking in spoken language models, reducing latency and enhancing reasoning capabilities.

Findings

01

Outperforms baselines by 15% on math reasoning datasets

02

Matches latency of models without unspoken reasoning

03

Performs equally well on non-reasoning datasets

Abstract

Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The idea is novel — enabling SLMs to think internally while speaking. The chunked reasoning design, STITCH-R and STITCH-S is creative and practical for real-time systems. - Overall the paper is very well-written

Weaknesses

- The experiments mainly focus on math reasoning datasets like GSM8K and SVAMP. It would be valuable to test STITCH on more diverse reasoning domains such as commonsense, dialogue reasoning, or multi-hop factual reasoning.

Reviewer 02Rating 6Confidence 3

Strengths

- The authors introduce three ways of integrating reasoning into spoken language models: Thinking Before Speech (TBS), Simultaneous Thinking and Talking with Reasoning First (STITCH-R), and Simultaneous Thinking and Talking with Speaking First (STITCH-S). The methodology is clearly described, and Figure 2 effectively visualizes the differences between these methods. - The analysis is good. The authors report performance while varying the length of reasoning chunks during inference and analyze th

Weaknesses

- Based on the performance tables (Table 1-(a) and 1-(b)), there is no clear winner that consistently outperforms all other baselines. On math datasets, STITCH-R and STITCH-S show mixed performance across models, sometimes performing significantly worse than TBS (e.g., TBS 64.94, STITCH-R 58.70, STITCH-S 56.72). Similarly, on non-reasoning datasets, STITCH-R and STITCH-S perform inconsistently relative to other baselines. The paper does not clearly explain the reasons behind these trends. - Foll

Reviewer 03Rating 4Confidence 3

Strengths

**[S1]** The paper is written clearly and is easy to follow. The operational mechanism of the proposed method is straightforward to understand. **[S2]** Notably, the model exhibits comparable results on mathematical tasks compared to approaches that explicitly perform reasoning with reduced latency.

Weaknesses

While the study presents an interesting direction, the scope of its **novelty and generalizability appears somewhat limited**. The following points are offered as considerations rather than criticisms: **[W1]** The reported optimization is based on the **A100 + vLLM** setting, which may limit the applicability of the results. It remains uncertain whether the proposed approach would generalize well to limited hardwares, larger models, or alternative architectures, such as those that jointly opti

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning