Stabilizing Policy Optimization via Logits Convexity
Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao

TL;DR
This paper identifies the convexity of the supervised fine-tuning loss at the logits level as a key factor for stable reinforcement learning training, and introduces a new optimization method called Logits Convex Optimization (LCO) that enhances stability and performance.
Contribution
The paper reveals the importance of logits-level convexity for stable RL training and proposes LCO, a novel framework that improves stability and performance across various benchmarks.
Findings
LCO consistently improves training stability over traditional RL methods.
Logits convexity is crucial for favorable gradient directions during optimization.
Experiments show LCO outperforms PPO and other baselines on multiple benchmarks.
Abstract
While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
