Stabilizing Policy Optimization via Logits Convexity

Hongzhan Chen; Tao Yang; Yuhua Zhu; Shiping Gao; Xiaojun Quan; Ting Yao

arXiv:2603.00963·cs.LG·March 3, 2026

Stabilizing Policy Optimization via Logits Convexity

Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao

PDF

Open Access

TL;DR

This paper identifies the convexity of the supervised fine-tuning loss at the logits level as a key factor for stable reinforcement learning training, and introduces a new optimization method called Logits Convex Optimization (LCO) that enhances stability and performance.

Contribution

The paper reveals the importance of logits-level convexity for stable RL training and proposes LCO, a novel framework that improves stability and performance across various benchmarks.

Findings

01

LCO consistently improves training stability over traditional RL methods.

02

Logits convexity is crucial for favorable gradient directions during optimization.

03

Experiments show LCO outperforms PPO and other baselines on multiple benchmarks.

Abstract

While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques