Non-Asymptotic Global Convergence of PPO-Clip
Yin Liu, Qiming Dai, Junyu Zhang, Zaiwen Wen

TL;DR
This paper provides a rigorous theoretical analysis of the PPO-Clip algorithm, establishing non-asymptotic convergence rates and conditions for global optimality in reinforcement learning with policy regularization.
Contribution
It introduces a non-asymptotic convergence analysis of PPO-Clip under general RL settings with f-divergence regularization, including new smoothness and inequality conditions.
Findings
Proves linear convergence to the global optimum for forward KL-regularizer.
Establishes stationary and local linear convergence for reverse KL-regularizer.
Provides theoretical foundations for PPO-Clip's empirical success.
Abstract
Reinforcement learning (RL) has gained attention for aligning large language models (LLMs) via reinforcement learning from human feedback (RLHF). The actor-only variants of Proximal Policy Optimization (PPO) are widely applied for their efficiency. These algorithms incorporate a clipping mechanism to improve stability. Besides, a regularization term, such as the reverse KL-divergence or a more general \(f\)-divergence, is introduced to prevent policy drift. Despite their empirical success, a rigorous theoretical understanding of the problem and the algorithm's properties is limited. This paper advances the theoretical foundations of the PPO-Clip algorithm by analyzing a deterministic actor-only PPO algorithm within the general RL setting with \(f\)-divergence regularization under the softmax policy parameterization. We derive a non-uniform Lipschitz smoothness condition and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Speech and dialogue systems · Machine Learning and Algorithms
