Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning
Zhuoming Chen, Hongyi Liu, Yang Zhou, Haizhong Zheng, Beidi Chen

TL;DR
The paper introduces Jackpot, a novel framework using Optimal Budget Rejection Sampling to align rollout models with evolving policies, improving stability and efficiency in reinforcement learning for large language models.
Contribution
It presents a new OBRS-based method for reducing distribution mismatch in decoupled RL training, with a unified training objective and efficient implementation.
Findings
Significantly improves training stability over importance-sampling methods.
Achieves performance comparable to on-policy RL with fewer update steps.
Theoretically guarantees closer distribution alignment under a fixed acceptance budget.
Abstract
Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top- probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Domain Adaptation and Few-Shot Learning
