Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning
Yu Luo, Shuo Han, Yihan Hu, Dong Li, Jianye Hao

TL;DR
This paper introduces R^2VPO, a new policy optimization method that stabilizes training and improves data efficiency in fine-tuning large language models by controlling the variance of policy ratios.
Contribution
It proposes a ratio-variance regularization framework for policy optimization, enabling stable on-policy learning and effective off-policy data reuse in LLM fine-tuning.
Findings
Achieves up to 17% performance gains over clipping-based methods.
Requires about 50% fewer rollouts to reach convergence.
Demonstrates improved stability and data efficiency in LLM fine-tuning.
Abstract
On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping stabilizes training, this heuristic hard constraint incurs a fundamental cost: it indiscriminately truncates gradients from high-return yet high-divergence actions, suppressing rare but highly informative "eureka moments" in complex reasoning. Moreover, once data becomes slightly stale, hard clipping renders it unusable, leading to severe sample inefficiency. In this work, we revisit the trust-region objective in policy optimization and show that explicitly constraining the \emph{variance (second central moment) of the policy ratio} provides a principled and smooth relaxation of hard clipping. This distributional constraint stabilizes policy updates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
