Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

Yao Shu; Chenxing Wei; Hongbin Lin; Shuang Qiu; Hui Xiong

arXiv:2605.02469·cs.LG·May 5, 2026

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

Yao Shu, Chenxing Wei, Hongbin Lin, Shuang Qiu, Hui Xiong

PDF

TL;DR

This paper introduces a reference-sampled Boltzmann projection method for KL-regularized reinforcement learning, enabling efficient policy optimization and analysis of finite-sample effects, with empirical validation on Qwen.

Contribution

It develops a novel weighted supervised fine-tuning objective aligned with KL-regularized RLVR, and provides finite one-shot analysis and practical algorithms for policy projection.

Findings

01

The proposed Boltzmann-Targeted SFT matches the RLVR optimizer.

02

Finite one-shot analysis separates key error sources and explains coverage limitations.

03

Experiments show improved projection accuracy and optimization efficiency.

Abstract

Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.