Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
Yao Shu, Chenxing Wei, Hongbin Lin, Shuang Qiu, Hui Xiong

TL;DR
This paper introduces a reference-sampled Boltzmann projection method for KL-regularized reinforcement learning, enabling efficient policy optimization and analysis of finite-sample effects, with empirical validation on Qwen.
Contribution
It develops a novel weighted supervised fine-tuning objective aligned with KL-regularized RLVR, and provides finite one-shot analysis and practical algorithms for policy projection.
Findings
The proposed Boltzmann-Targeted SFT matches the RLVR optimizer.
Finite one-shot analysis separates key error sources and explains coverage limitations.
Experiments show improved projection accuracy and optimization efficiency.
Abstract
Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
