SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

Alexis Limozin; Eduard Durech; Torsten Hoefler; Imanol Schlag; Valentina Pyatkin

arXiv:2604.23747·cs.LG·April 28, 2026

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

Alexis Limozin, Eduard Durech, Torsten Hoefler, Imanol Schlag, Valentina Pyatkin

PDF

1 Repo

TL;DR

This paper identifies and corrects critical bugs in baseline implementations, demonstrating that the standard SFT-then-RL approach outperforms recent mixed-policy methods in LLM reasoning tasks.

Contribution

The authors reveal and fix two bugs in existing baselines, showing that the corrected SFT-then-RL pipeline surpasses recent mixed-policy methods in performance.

Findings

01

Corrected baseline improves math benchmark scores by +3.8 and +22.2 points.

02

A truncated SFT-then-RL variant outperforms mixed-policy methods with fewer FLOPs.

03

Faulty baselines underestimated the effectiveness of the standard SFT-then-RL pipeline.

Abstract

Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alek6kun/sft_then_rl
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.