StreamingThinker: Large Language Models Can Think While Reading
Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

TL;DR
StreamingThinker introduces a streaming reasoning paradigm for large language models, enabling them to think while reading, which reduces latency and maintains reasoning performance in dynamic scenarios.
Contribution
It is the first framework to implement streaming reasoning in LLMs, combining streaming CoT generation, order-preserving mechanisms, and parallel inference for real-time reasoning.
Findings
80% reduction in token waiting time before reasoning
Over 60% reduction in overall latency for final answers
Maintains reasoning performance comparable to batch processing
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper introduces an interesting idea of streaming thinking, where LLMs reason concurrently with incoming input rather than after receiving the full context. This paradigm is conceptually appealing and well-motivated by the analogy to human cognition, offering a fresh perspective on reducing reasoning latency while maintaining coherence. 2. The proposed framework is technically complete, integrating data construction, streaming-aware training, and parallel inference into a coherent system.
1. The authors should discuss more about why the method works. Thinking while reading is interesting, but it is obvious that the question is not complete while reading. There have been several papers [1,2,3] that mention that the incomplete questions will actually affect the model performance negatively. So, why think based on the incomplete questions during reading can improve performance? [1] Laban, Philippe, et al. "Llms get lost in multi-turn conversation." arXiv preprint arXiv:2505.06120 (
* The paper is well written and figures are very clear. * The method and paradigm are novel since I am not aware of any other work which performs reasoning concurrently with input prefill. This idea is also well motivated and could be highly impactful. * Evaluation is performed over a diverse set of datasets and show a substantial reduction in number of reasoning tokens as well as latency with little reduction in accuracy. * The distinction and evaluation of the question-first and context-first
* Tables lack variances. * The parallelization still occurs at sentence level chunks, and finer/larger granularities were not investigated.
- Introduce paradigm that allows model to stream while reasoning - Intuitive, Well-engineered and clearly described training and evaluation frameworks - Demonstrate that using streaming performs on par with batch reasoning but with less latency - Use of multiple domains to demonstrate that this paradigm does well on different types of problems
- My primary reservation is the motivation behind the need for streaming thinking. What kind of application requires steaming reasoning? In what scenarios is batch thinking insufficient, especially when the batches are small? In the example in Figure 1, it makes much more intuitive sense to perform batch thinking to avoid "overly eager", not-so-helpful thinking tokens like "Okay, now we get the background of Charlie's schedule." Intuitively, I would argue that math questions evaluated in this pa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks
