BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning
Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, Samet Oymak

TL;DR
BREAD introduces a novel training method that combines expert guidance with reinforcement learning to improve small language models' reasoning, especially when high-quality data is scarce, outperforming traditional approaches.
Contribution
The paper proposes BREAD, a method that unifies supervised fine-tuning and reinforcement learning using branched rollouts and expert hints, overcoming limitations of existing SFT + RL strategies.
Findings
BREAD outperforms standard GRPO in fewer than 40% of traces.
BREAD speeds up training by approximately 3 times.
BREAD enables solving problems previously unsolvable by SFT + RL.
Abstract
Small language models (SLMs) struggle to learn complex reasoning behaviors, especially when high-quality traces are scarce or difficult to learn from. The standard training approach combines a supervised fine-tuning (SFT) stage, often to distill capabilities of a larger model, followed by a reinforcement learning (RL)stage such as Group Relative Policy Optimization (GRPO). In this paper, we investigate the fundamental limitations of this SFT + RL paradigm and propose methods to overcome them. Under a suitable theoretical model, we demonstrate that the SFT + RL strategy can fail completely when (1) the expert's traces are too difficult for the small model to express, or (2) the small model's initialization has exponentially small likelihood of success. To address these, we introduce BREAD: a GRPO variant that unifies the SFT and RL stages via partial expert guidance and branched…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
