ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context
Joongwon Kim, Anirudh Goyal, Liang Tan, Hannaneh Hajishirzi, Srinivasan Iyer, Tianlu Wang

TL;DR
ASTRO is a novel training framework that enhances language models' reasoning by teaching them structured search, reflection, and backtracking, leading to significant performance improvements on mathematical problem-solving benchmarks.
Contribution
Introduces ASTRO, a search-inspired training method that enables non-reasoner language models to internalize structured reasoning behaviors through synthetic search traces and reinforcement learning.
Findings
Achieved up to 26.9% performance gains on math benchmarks.
Improved reasoning on problems requiring iterative correction.
Demonstrated effectiveness on Llama 3 models.
Abstract
We introduce ASTRO, the "Autoregressive Search-Taught Reasoner", a framework for training language models to reason like search algorithms, explicitly leveraging self-reflection, backtracking, and exploration in their outputs. Recently, training large language models (LLMs) via reinforcement learning (RL) has led to the advent of reasoning models with greatly enhanced reasoning capabilities. Open-source replications of reasoning models, while successful, build upon models that already exhibit strong reasoning capabilities along with search behavior observed even before RL. As a result, it is yet unclear how to boost the reasoning capabilities of other non-reasoner models including Llama 3. ASTRO teaches such models to internalize structured search behavior through a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. By…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The presentation is good and easy to understand. The figures elucidate the concepts, particularly the structure of the MCTS traces and the linearization process. The main novelty is introducing the artificial backtracking behaviors. This differentiates from many prior works which also collect MCTS traces but only use directed correct paths. The experiments demonstrate that models initialized with the search prior outperform those initialized only with direct solutions. The training and evaluat
We know that reasoning models extensively use backtracking, a key behavior that previous instruction following models don't have. It is good to know that artificial backtracking improves performance, but from this point of view, the novelty appears limited. It would be better to ablate on this key novelty. Even though we publicly know very little about deepseek-r1's cold-start data, or whether openai-o1 uses cold-start data at all, the author could compare the effect of their synthetic data with
Using MCTS to bootstrap reasoning trajectory data for SFT is a reasonable and likely to be a promising idea, if combined with RL and SFT properly. At its current version, I'm not sure if I understand the benefit of using MCTS correctly in this paper (see below).
- Empirical results are not strong: The main results in Table 1 only shows marginal gain over the other baselines (SPOC, StepKTO), while it's not clear how their complexity differ. My impression is that Astro is a much more complex than SPOC and Step-KTO based on the description of the paper. - The main insight is not clear. This paper mentioned many tricks on preparing data for SFT and the application of RL, but I don't get what the main insight is compared with the prior works. As mentioned in
The central idea—procedure-cloning search traces into natural language and then optimizing with verifier-based RL—is original in how it couples explicit backtracking with a purely autoregressive policy. Unlike external scaffolds, the model is trained to “think like search” within a single pass, and the linearization scheme that stitches together incorrect and correct endpoints is a clever way to distill exploration signals into text. Methodologically, the paper is clear about the three stages,
Important related work is missing or under-emphasized. In particular, Satori (an RL-trained LRM with autoregressive search behaviors), RAP, and rStar—the earliest approaches that use MCTS with LLMs—should be positioned as direct antecedents, with a careful comparison of objectives, supervision signals, tree policies, and data generation. This gap makes it harder to judge conceptual novelty relative to prior MCTS-style pipelines and RL-trained “search-in-language” agents. The evaluation suite als
The paper is clearly written and easy to follow, effectively contextualizing the work within prior research such as procedure cloning [1] and stream of search [2]. [1] Yang et al., Chain of Thought Imitation with Procedure Cloning, NeurIPS 2022 \ [2] Gandhi et al., Stream of Search (SoS): Learning to Search in Language, COLM 2024
The paper's main objective is to investigate whether an LLM can perform long-form reasoning without relying on pre-existing reasoning traces, such as those from large CoT models like DeepSeek r1. However, I have significant concerns on this. 1. Practicality: The research objective appears misaligned with the chosen domain of mathematical reasoning. Large-scale, high-quality SFT datasets with long reasoning traces are already widely available for math tasks. Consequently, the motivation for empl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
