Beyond Model Scaling: Test-Time Intervention for Efficient Deep Reasoning

Qianyue Wang; Jinwu Hu; Yufeng Wang; Huanxiang Lin; Bolin Chen; Zhiquan Wen; Yaofo Chen; Mingkui Tan

arXiv:2601.11252·cs.AI·January 19, 2026

Beyond Model Scaling: Test-Time Intervention for Efficient Deep Reasoning

Qianyue Wang, Jinwu Hu, Yufeng Wang, Huanxiang Lin, Bolin Chen, Zhiquan Wen, Yaofo Chen, Mingkui Tan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Think-with-Me, a test-time interactive reasoning paradigm that incorporates external feedback to improve the efficiency and accuracy of large reasoning models during multi-step reasoning tasks.

Contribution

It proposes a novel test-time intervention method using external feedback at transitional points to enhance reasoning efficiency and accuracy in large models.

Findings

01

Outperforms existing methods on AIME24 with 7.19% higher accuracy.

02

Reduces reasoning length by 81% under limited context windows.

03

Effective in security and creative reasoning tasks.

Abstract

Large Reasoning Models (LRMs) excel at multi-step reasoning but often suffer from inefficient reasoning processes like overthinking and overshoot, where excessive or misdirected reasoning increases computational cost and degrades performance. Existing efficient reasoning methods operate in a closed-loop manner, lacking mechanisms for external intervention to guide the reasoning process. To address this, we propose Think-with-Me, a novel test-time interactive reasoning paradigm that introduces external feedback intervention into the reasoning process. Our key insights are that transitional conjunctions serve as natural points for intervention, signaling phases of self-validation or exploration and using transitional words appropriately to prolong the reasoning enhances performance, while excessive use affects performance. Building on these insights, Think-with-Me pauses reasoning at…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1、The study fully adheres to a rigorous logical chain of scientific exploration. The team first systematically analyzed the model’s reasoning behavior, identified that "transitional words" can serve as intervention nodes, and validated their effectiveness. This rigorous preliminary exploration laid a theoretical foundation for the design of Think-with-Me, forming a scientific closed loop. 2、The experimental validation is comprehensive, covering a wide range of tasks and comparing with various ma

Weaknesses

1、lthough the authors provide additional experimental details in the appendix, they do not release the source code, resulting in low reproducibility. 2、his approach relies on an external feedback mechanism, which may introduce new risks. If the LLM proxy generates incorrect feedback and the target model lacks a built-in correction mechanism, performance degradation or error propagation could occur; the authors offer no further discussion on this issue.

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper is well-written, clear, and easy to follow. 2. The proposed method achieves a significant improvement in compressing reasoning length. 3. It provides both LLM-based and human-in-the-loop feedback mechanisms, enhancing the scalability of the approach.

Weaknesses

1. Several existing works, for example [1] have already explored, from various perspectives, the use of external feedback frameworks—such as human feedback, model feedback, or verifiers—to improve LLM training, and these approaches can be adapted to address the long2short problem. Therefore, the empirical motivation of this paper needs to be further strengthened. 2. The paper's two key observations are not new: the first observation has been similarly articulated in numerous works since [2], and

Reviewer 03Rating 2Confidence 4

Strengths

* This paper is well-written and easy to understand. * This paper addresses an important issue in optimizing the test-time reasoning efficiency.

Weaknesses

1. My main criticism is **on the design of the experiments**: a. In the observation experiments of Section 3.1 in Figure 1, the authors conduct their analysis on DeepSeek-R1-Distill-Qwen-32B, showing in Figures 1(a) and 1(b) the prevalence of conjunction tokens in the reasoning traces of **o1-like models**. However, when investigating the influence of these tokens, they switch their experimental model to Qwen2.5-72B-Instruct, which, as far as I know, does **not possess long reasoning ability*

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics