StreamingThinker: Large Language Models Can Think While Reading

Junlong Tong; Yingqi Fan; Anhao Zhao; Yunpu Ma; Xiaoyu Shen

arXiv:2510.17238·cs.CL·March 20, 2026

StreamingThinker: Large Language Models Can Think While Reading

Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

PDF

Open Access 3 Reviews

TL;DR

StreamingThinker introduces a streaming reasoning paradigm for large language models, enabling them to think while reading, which reduces latency and maintains reasoning performance in dynamic scenarios.

Contribution

It is the first framework to implement streaming reasoning in LLMs, combining streaming CoT generation, order-preserving mechanisms, and parallel inference for real-time reasoning.

Findings

01

80% reduction in token waiting time before reasoning

02

Over 60% reduction in overall latency for final answers

03

Maintains reasoning performance comparable to batch processing

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper introduces an interesting idea of streaming thinking, where LLMs reason concurrently with incoming input rather than after receiving the full context. This paradigm is conceptually appealing and well-motivated by the analogy to human cognition, offering a fresh perspective on reducing reasoning latency while maintaining coherence. 2. The proposed framework is technically complete, integrating data construction, streaming-aware training, and parallel inference into a coherent system.

Weaknesses

1. The authors should discuss more about why the method works. Thinking while reading is interesting, but it is obvious that the question is not complete while reading. There have been several papers [1,2,3] that mention that the incomplete questions will actually affect the model performance negatively. So, why think based on the incomplete questions during reading can improve performance? [1] Laban, Philippe, et al. "Llms get lost in multi-turn conversation." arXiv preprint arXiv:2505.06120 (

Reviewer 02Rating 8Confidence 3

Strengths

* The paper is well written and figures are very clear. * The method and paradigm are novel since I am not aware of any other work which performs reasoning concurrently with input prefill. This idea is also well motivated and could be highly impactful. * Evaluation is performed over a diverse set of datasets and show a substantial reduction in number of reasoning tokens as well as latency with little reduction in accuracy. * The distinction and evaluation of the question-first and context-first

Weaknesses

* Tables lack variances. * The parallelization still occurs at sentence level chunks, and finer/larger granularities were not investigated.

Reviewer 03Rating 8Confidence 4

Strengths

- Introduce paradigm that allows model to stream while reasoning - Intuitive, Well-engineered and clearly described training and evaluation frameworks - Demonstrate that using streaming performs on par with batch reasoning but with less latency - Use of multiple domains to demonstrate that this paradigm does well on different types of problems

Weaknesses

- My primary reservation is the motivation behind the need for streaming thinking. What kind of application requires steaming reasoning? In what scenarios is batch thinking insufficient, especially when the batches are small? In the example in Figure 1, it makes much more intuitive sense to perform batch thinking to avoid "overly eager", not-so-helpful thinking tokens like "Okay, now we get the background of Charlie's schedule." Intuitively, I would argue that math questions evaluated in this pa

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks