Step-Tagging: Toward controlling the generation of Language Reasoning Models through step monitoring
Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher

TL;DR
This paper introduces Step-Tagging, a framework for real-time monitoring and controlling Language Reasoning Models by annotating reasoning steps, leading to more efficient inference with reduced token usage while maintaining accuracy.
Contribution
The paper presents a novel step-tagging framework and ReasonType taxonomy for real-time reasoning step annotation, enabling early stopping and improved control over LRMs.
Findings
Achieved 20-50% token reduction while maintaining accuracy.
Effective early stopping criteria based on reasoning step counts.
Demonstrated framework on multiple reasoning models and datasets.
Abstract
The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. To address this challenge, we introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating. To monitor reasoning behaviors, we introduced ReasonType: a novel taxonomy of reasoning steps. Building on this framework, we demonstrated that online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences. We evaluate the Step-tagging framework on three open-source reasoning models across standard benchmark datasets:…
Peer Reviews
Decision·Submitted to ICLR 2026
1. **High Innovativeness and Practicality:** The paper directly confronts the core pain point of low efficiency in current LLMs for complex reasoning tasks. The proposed Step-Tagging framework and ReasonType taxonomy offer a novel and practical perspective for understanding and controlling the model's "thought process." Compared to methods that rely on "black-box" approaches or complex prompt engineering, this framework is more interpretable and generalizable. 2. **Clear Methodology and Compl
1. **Taxonomy Subjectivity:** The "ReasonType" taxonomy was created with GPT-4o-mini, which introduces potential subjectivity and dependency on a specific model's capabilities. 2. **Application Complexity:** The framework requires a calibration step ("Pareto-curve") to find the optimal stopping strategy for each model and task, which raises the barrier to adoption. 3. **Offline vs. Online Gap:** Experiments were simulated offline. The potential latency from a real-time, on-the-fly implementat
- A new taxonomy of reasoning steps with 13 categories is proposed, providing a tool for fine-grained understanding and monitoring of the LRM reasoning process. - A lightweight sentence classifier module is designed, capable of identifying the type of steps being generated by LRMs in real-time, enabling online monitoring of the reasoning process. - An interpretable early stopping mechanism is validated based on the frequency of specific step types, demonstrating significant token reduction whi
- An evaluation of the latency introduced by the Step-Tagger module in inference scenarios must be included in the paper. It needs to be demonstrated that the inference time of the classifier itself is significantly less than the time saved by token reduction. - Although Appendix G argues for the choice's reasonableness, it remains a critical hyperparameter that needs manual calibration for each new model, increasing the method's application complexity. - The P_guided baseline (especially the
1. The paper presents a taxonomy of reasoning steps in LLMs. 2. The authors correctly leverage their taxonomy to obtain token efficiency without damaging results. 3. The paper clearly presents their idea and method.
1. Tags are derived from GPT-4o-mini. The authors do not mention or run an ablation study on this training dataset. 2. Ablation on labels. The authors do not show the quality of their tags. They can extract a subset of their dataset and show a comparison with other models or human annotators. 3. The BERT router’s Micro-F1 ≈0.78 suggests routing errors may affect benefits. It is unclear how router errors propagate to overall accuracy/efficiency 4. Figures cannot be correctly visualized at the cu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
