How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning

Hongyi James Cai; Junlin Wang; Xiaoyin Chen; Bhuwan Dhingra

arXiv:2505.24273·cs.AI·June 2, 2025

How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning

Hongyi James Cai, Junlin Wang, Xiaoyin Chen, Bhuwan Dhingra

PDF

Open Access 3 Reviews

TL;DR

This paper systematically explores how supervised finetuning and reinforcement learning interact to improve reasoning in large language models, focusing on the role and optimal extent of backtracking in chain-of-thought processes.

Contribution

It provides empirical insights into the effects of backtracking and chain-of-thought length on RL training, revealing how task difficulty influences optimal training strategies.

Findings

01

Longer chain-of-thought with backtracking improves RL training stability.

02

More complex problems require more backtracking during supervised finetuning.

03

RL training emphasizes structural patterns over correctness of long reasoning sequences.

Abstract

Recent breakthroughs in large language models (LLMs) have effectively improved their reasoning abilities, particularly on mathematical and logical problems that have verifiable answers, through techniques such as supervised finetuning (SFT) and reinforcement learning (RL). Prior research indicates that RL effectively internalizes search strategies, enabling long chain-of-thought (CoT) reasoning, with backtracking emerging naturally as a learned capability. However, the precise benefits of backtracking, specifically, how significantly it contributes to reasoning improvements and the optimal extent of its use, remain poorly understood. In this work, we systematically investigate the dynamics between SFT and RL on eight reasoning tasks: Countdown, Sudoku, Arc 1D, Geometry, Color Cube Rotation, List Functions, Zebra Puzzles, and Self Reference. Our findings highlight that short CoT…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. Very Good Research Question: How SFT & RL affects LLM reasoning 2. Good Experiment settings on SFT types / benchmarks etc. 3. Propose a practical method to boost reasoning

Weaknesses

1. **Insufficient Experiment on model choices: Only Qwen-2.5-3B**. [1] have shown that Qwen has something specific when considering reasoning abilities, so I believe more **experiments on other series open-source models and sizes (maybe ~7B)** should be added. [1] Spurious Rewards: Rethinking Training Signals in RLVR. https://arxiv.org/abs/2506.10947

Reviewer 02Rating 4Confidence 4

Strengths

- The paper carefully explores the effects of different SFT warm-up strategies (no-SFT, self-sampled, synthetic backtracking, shuffled) on RL training, providing clear empirical comparisons - The tasks, metrics, and model configurations (based on Qwen2.5 family) are clearly described, and synthetic datasets are constructed in a principled way using DFS or heuristic search.

Weaknesses

- Task scope is narrow: The evaluated tasks are mostly puzzle-style logical reasoning (Countdown, Sudoku, etc.), which limits generalizability to other forms of reasoning like mathematical proofs, symbolic integration, or commonsense reasoning. - Limited model diversity: Experiments rely almost entirely on the Qwen2.5 family; it’s unclear whether the findings hold for other architectures (e.g., Llama). - The general idea of combining SFT warm-up with RL is well-trodden; the specific contributi

Reviewer 03Rating 2Confidence 4

Strengths

**Originality:** The paper provides a systematic empirical investigation of backtracking's role in RL post-training through controlled synthetic data construction. The use of depth-first search and heuristic methods to generate trajectories with precise backtracking counts (0, 1, 2, 3, 5, 10) offers a methodologically clean approach to studying this phenomenon. The finding that incorrect trajectories can benefit RL training (Section 5.1) adds an interesting dimension to understanding SFT-RL inte

Weaknesses

**W1: Insufficient Motivation for Focusing Solely on Backtracking (Lines 92-93, Section 1)** The paper abruptly introduces backtracking as the primary research focus without adequately justifying why this specific behavior merits exclusive attention over other potentially important reasoning patterns. Modern reasoning models exhibit multiple cognitive behaviors beyond backtracking, including: - Multiple verification strategies - Alternative solution exploration The manuscript lacks a princip

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Software Engineering Research · Artificial Intelligence in Law

MethodsShrink and Fine-Tune