OBLR-PO: A Theoretical Framework for Stable Reinforcement Learning

Zixun Huang; Jiayi Sheng; Zeyu Zheng

arXiv:2511.23310·stat.ML·January 16, 2026

OBLR-PO: A Theoretical Framework for Stable Reinforcement Learning

Zixun Huang, Jiayi Sheng, Zeyu Zheng

PDF

Open Access

TL;DR

This paper introduces a unified theoretical framework for reinforcement learning that characterizes policy-gradient estimators, enabling principled improvements in training stability and performance for large language models.

Contribution

It provides a systematic theoretical analysis of policy-gradient estimators, deriving variance expressions, convergence guarantees, and an adaptive learning-rate schedule, leading to the OBLR-PO algorithm.

Findings

01

OBLR-PO improves training stability and performance.

02

Theoretical analysis guides adaptive learning rates based on SNR.

03

Experimental results show consistent gains on large language models.

Abstract

Existing reinforcement learning (RL)-based post-training methods for large language models have advanced rapidly, yet their design has largely been guided by heuristics rather than systematic theoretical principles. This gap limits our understanding of the properties of the gradient estimators and the associated optimization algorithms, thereby constraining opportunities to improve training stability and overall performance. In this work, we provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators under mild assumptions. Our analysis establishes unbiasedness, derives exact variance expressions, and yields an optimization-loss upper bound that enables principled reasoning about learning dynamics. Building on these results, we prove convergence guarantees and derive an adaptive learning-rate schedule governed by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Topic Modeling