Beyond Importance Sampling: Rejection-Gated Policy Optimization
Ziwu Sun, Zhen Gao, Jiyong Zhang, Jiaheng Li

TL;DR
This paper introduces Rejection-Gated Policy Optimization (RGPO), a new method that selectively trusts samples for policy updates, improving stability and performance in reinforcement learning.
Contribution
RGPO replaces importance sampling ratios with a differentiable acceptance gate, unifies existing policy gradient methods, and guarantees bounded variance and bias.
Findings
RGPO guarantees finite, bounded gradient variance with heavy-tailed importance ratios.
RGPO incurs only bounded, controllable bias and offers an approximate monotonic policy improvement.
In experiments, RGPO outperforms PPO in reward and KL divergence metrics.
Abstract
We propose a new perspective on policy optimization: rather than reweighting all samples by their importance ratios, an optimizer should select which samples are trustworthy enough to drive a policy update. Building on this view, we introduce Rejection-Gated Policy Optimization (RGPO), which replaces the importance sampling ratio r_theta = pi_theta / pi_old with a smooth, differentiable acceptance gate alpha_theta(s, a) = g(r_theta(s, a)) in the range [0, 1]. Unlike prior work that applies rejection sampling as a data-level heuristic before training, RGPO elevates rejection to an optimization principle: the gate participates directly in gradient computation and is implicitly updated alongside the policy. RGPO provides a unified framework: the policy gradients of TRPO, PPO, and REINFORCE all correspond to specific choices of the effective gradient weight w(r) = g'(r) * r. We prove that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
