TL;DR
This paper analyzes a regularized policy optimization method for two-player zero-sum games, providing theoretical convergence guarantees and demonstrating improved training efficiency in various board games.
Contribution
It offers new theoretical convergence guarantees and develops a practical reinforcement learning algorithm that outperforms existing methods in multiple two-player games.
Findings
The policy update rule is stable with convergence guarantees in theoretical settings.
The proposed algorithm learns more efficiently than existing methods.
Empirical validation on five board games shows improved training efficiency.
Abstract
Two-player games such as board games have long been used as traditional benchmarks for reinforcement learning. This work revisits a policy optimization method with reverse Kullback-Leibler regularization and entropy regularization and analyzes this combination in two-player zero-sum settings from theoretical and empirical perspectives. From a theoretical perspective, we investigate the stability of the policy update rule in two theoretical settings: game-theoretic normal-form games and finite-length games. We provide novel convergence guarantees and verify our theoretical results through numerical experiments on synthetic games. From an empirical perspective, we derive a practical model-free reinforcement learning algorithm based on the regularized policy optimization. We validate the training efficiency of our algorithm through comprehensive experiments on five board games: Animal…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is clearly written, and reports convincing performance gains in learning efficiency. KLENT achieves higher win rates or faster training progress than heavy search-based algorithms (AlphaZero, Gumbel-AlphaZero) under the same computational budget. Environment and baseline is comprehensive, and hyper-parameter and implementation details are given. A key strength is the algorithm’s simplicity relative to AlphaZero-style methods. By avoiding MCTS, the method is much easier to implement and
It is important to note that the paper’s value may lie in the empirical finding that this straightforward combination works remarkably well on complex board games. Demonstrating that “model-free RL (with proper regularization) can rival search-based methods in these games” is a useful result for the community, especially for those who cannot afford massive search-based training. However, the lack of algorithmic novelty means the paper’s contributions are primarily empirical and engineering-orien
* The policy update rule introduced in 4.1 is novel to the best of my knowledge and could be applied in more algorithms for decision-making that use KL divergence regularization (possibly outside perfect information games). * The performance of KLENT is improvement over strong baselines like PPO and Gumbel AlphaZero (without rollouts in test-time). * The detailed experimental section including ablation study and test on a large game (19x19 Go). * When combined with MCTS test-time search, KLENT c
The main algorithm is almost identical to [1] and [2], which are applicable to broader class of games. The main difference to those algorithms seem to be the new policy-update rule and that KLENT does not use the regularization policy update from [2]. However, the paper does not explain the relationship in detail nor does it provide direct empirical comparison. Compared to [2], which is off-policy, KLENT works only on-policy. However, the extension to off-policy KLENT seems plausible. The str
- KLENT shows state-of-the-art performance on board games benchmark, outperforming AlphaZero and classic DRL algorithms. - The proposed KLENT algorithm is relatively simple which allows for further improvements and application of key ideas in the future research - A solid research on hyperparameter sensitivity for proposed algorithm.
- Ablation study is not conducted well enough. While it is present in the paper, there are no direct comparision of KLENT with ablated features and KLENT without ablation.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Advanced Bandit Algorithms Research
