Beyond Pessimism: Offline Learning in KL-regularized Games
Yuheng Zhang, Claire Chen, Nan Jiang

TL;DR
This paper introduces a novel offline learning algorithm for KL-regularized two-player zero-sum games that avoids pessimism, achieving faster statistical rates and providing practical policy optimization methods.
Contribution
It develops the first pessimism-free offline learning guarantee for KL-regularized games with a near-optimal sample complexity of 1/n.
Findings
Achieves a (1/n) sample complexity bound.
Introduces a self-play policy optimization algorithm with theoretical guarantees.
Provides the first pessimism-free guarantee for KL-regularized game learning.
Abstract
We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized with respect to a fixed reference policy through KL regularization. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields, to our knowledge, the first pessimism-free offline learning guarantee for KL-regularized games, with a fast sample complexity bound. We further propose an efficient self-play policy optimization algorithm that replaces exact equilibrium computation with iterative KL-regularized policy updates, and prove that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
