Logarithmic Regret for Online KL-Regularized Reinforcement Learning
Heyang Zhao, Chenlu Ye, Wei Xiong, Quanquan Gu, Tong Zhang

TL;DR
This paper introduces a new online KL-regularized reinforcement learning algorithm with a proven logarithmic regret bound, advancing theoretical understanding of KL-regularization's benefits in decision-making tasks.
Contribution
It presents the first optimism-based KL-regularized online bandit algorithm with a novel regret analysis, extending to reinforcement learning with similar guarantees.
Findings
Achieves logarithmic regret bound of O(η log(N_R T) d_R)
Extends the analysis to reinforcement learning with similar regret guarantees
Leverages benign optimization landscape induced by KL-regularization
Abstract
Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KL-regularization plays a pivotal role in improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored. While there is a recent line of work on the theoretical analysis of KL-regularized objective in decision making \citep{xiong2024iterative, xie2024exploratory,zhao2024sharp}, these analyses either reduce to the traditional RL setting or rely on strong coverage assumptions. In this paper, we propose an optimism-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret. By carefully leveraging the benign optimization landscape induced by the KL-regularization and the optimistic reward estimation, our algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmart Parking Systems Research · Distributed Control Multi-Agent Systems · Smart Grid Energy Management
