Selective Off-Policy Reference Tuning with Plan Guidance
Duc Anh Le, Tien-Phat Nguyen, Thien Huu Nguyen, Linh Ngo Van, and Trung Le

TL;DR
The paper introduces SORT, a method that enhances reinforcement learning by selectively guiding token predictions based on plan comparisons, improving reasoning benchmarks especially for weaker models.
Contribution
SORT is a novel approach that uses plan-guided token weighting to turn failures into informative learning signals without altering rollout generation.
Findings
SORT outperforms GRPO and guidance baselines across multiple benchmarks.
Largest gains are observed on weaker models.
SORT effectively turns failures into structured learning signals.
Abstract
Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
