Logarithmic Regret for Online KL-Regularized Reinforcement Learning

Heyang Zhao; Chenlu Ye; Wei Xiong; Quanquan Gu; Tong Zhang

arXiv:2502.07460·cs.LG·March 12, 2026

Logarithmic Regret for Online KL-Regularized Reinforcement Learning

Heyang Zhao, Chenlu Ye, Wei Xiong, Quanquan Gu, Tong Zhang

PDF

Open Access

TL;DR

This paper introduces a new online KL-regularized reinforcement learning algorithm with a proven logarithmic regret bound, advancing theoretical understanding of KL-regularization's benefits in decision-making tasks.

Contribution

It presents the first optimism-based KL-regularized online bandit algorithm with a novel regret analysis, extending to reinforcement learning with similar guarantees.

Findings

01

Achieves logarithmic regret bound of O(η log(N_R T) d_R)

02

Extends the analysis to reinforcement learning with similar regret guarantees

03

Leverages benign optimization landscape induced by KL-regularization

Abstract

Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KL-regularization plays a pivotal role in improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored. While there is a recent line of work on the theoretical analysis of KL-regularized objective in decision making \citep{xiong2024iterative, xie2024exploratory,zhao2024sharp}, these analyses either reduce to the traditional RL setting or rely on strong coverage assumptions. In this paper, we propose an optimism-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret. By carefully leveraging the benign optimization landscape induced by the KL-regularization and the optimistic reward estimation, our algorithm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Parking Systems Research · Distributed Control Multi-Agent Systems · Smart Grid Energy Management