OBLR-PO: A Theoretical Framework for Stable Reinforcement Learning
Zixun Huang, Jiayi Sheng, Zeyu Zheng

TL;DR
This paper introduces a unified theoretical framework for reinforcement learning that characterizes policy-gradient estimators, enabling principled improvements in training stability and performance for large language models.
Contribution
It provides a systematic theoretical analysis of policy-gradient estimators, deriving variance expressions, convergence guarantees, and an adaptive learning-rate schedule, leading to the OBLR-PO algorithm.
Findings
OBLR-PO improves training stability and performance.
Theoretical analysis guides adaptive learning rates based on SNR.
Experimental results show consistent gains on large language models.
Abstract
Existing reinforcement learning (RL)-based post-training methods for large language models have advanced rapidly, yet their design has largely been guided by heuristics rather than systematic theoretical principles. This gap limits our understanding of the properties of the gradient estimators and the associated optimization algorithms, thereby constraining opportunities to improve training stability and overall performance. In this work, we provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators under mild assumptions. Our analysis establishes unbiasedness, derives exact variance expressions, and yields an optimization-loss upper bound that enables principled reasoning about learning dynamics. Building on these results, we prove convergence guarantees and derive an adaptive learning-rate schedule governed by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Topic Modeling
