Rethinking the Trust Region in LLM Reinforcement Learning

Penghui Qi; Xiangxin Zhou; Zichen Liu; Tianyu Pang; Chao Du; Min Lin; Wee Sun Lee

arXiv:2602.04879·cs.LG·February 5, 2026

Rethinking the Trust Region in LLM Reinforcement Learning

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee

PDF

Open Access

TL;DR

This paper critiques the standard PPO algorithm for LLM reinforcement learning, proposing a divergence-based method (DPPO) that improves training stability and efficiency by better constraining policy updates.

Contribution

It introduces DPPO, a novel divergence-based constraint method for RL in LLMs, replacing heuristic clipping with principled divergence estimates and efficient approximations.

Findings

01

DPPO outperforms PPO in training stability.

02

DPPO achieves higher training efficiency.

03

DPPO demonstrates robustness across various LLM fine-tuning tasks.

Abstract

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education