Revisiting Regularized Policy Optimization for Stable and Efficient Reinforcement Learning in Two-Player Games

Kazuki Ota; Takayuki Osa; Motoki Omura; Tatsuya Harada

arXiv:2602.10894·cs.LG·May 22, 2026

Revisiting Regularized Policy Optimization for Stable and Efficient Reinforcement Learning in Two-Player Games

Kazuki Ota, Takayuki Osa, Motoki Omura, Tatsuya Harada

PDF

3 Reviews

TL;DR

This paper analyzes a regularized policy optimization method for two-player zero-sum games, providing theoretical convergence guarantees and demonstrating improved training efficiency in various board games.

Contribution

It offers new theoretical convergence guarantees and develops a practical reinforcement learning algorithm that outperforms existing methods in multiple two-player games.

Findings

01

The policy update rule is stable with convergence guarantees in theoretical settings.

02

The proposed algorithm learns more efficiently than existing methods.

03

Empirical validation on five board games shows improved training efficiency.

Abstract

Two-player games such as board games have long been used as traditional benchmarks for reinforcement learning. This work revisits a policy optimization method with reverse Kullback-Leibler regularization and entropy regularization and analyzes this combination in two-player zero-sum settings from theoretical and empirical perspectives. From a theoretical perspective, we investigate the stability of the policy update rule in two theoretical settings: game-theoretic normal-form games and finite-length games. We provide novel convergence guarantees and verify our theoretical results through numerical experiments on synthetic games. From an empirical perspective, we derive a practical model-free reinforcement learning algorithm based on the regularized policy optimization. We validate the training efficiency of our algorithm through comprehensive experiments on five board games: Animal…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The paper is clearly written, and reports convincing performance gains in learning efficiency. KLENT achieves higher win rates or faster training progress than heavy search-based algorithms (AlphaZero, Gumbel-AlphaZero) under the same computational budget. Environment and baseline is comprehensive, and hyper-parameter and implementation details are given. A key strength is the algorithm’s simplicity relative to AlphaZero-style methods. By avoiding MCTS, the method is much easier to implement and

Weaknesses

It is important to note that the paper’s value may lie in the empirical finding that this straightforward combination works remarkably well on complex board games. Demonstrating that “model-free RL (with proper regularization) can rival search-based methods in these games” is a useful result for the community, especially for those who cannot afford massive search-based training. However, the lack of algorithmic novelty means the paper’s contributions are primarily empirical and engineering-orien

Reviewer 02Rating 4Confidence 4

Strengths

* The policy update rule introduced in 4.1 is novel to the best of my knowledge and could be applied in more algorithms for decision-making that use KL divergence regularization (possibly outside perfect information games). * The performance of KLENT is improvement over strong baselines like PPO and Gumbel AlphaZero (without rollouts in test-time). * The detailed experimental section including ablation study and test on a large game (19x19 Go). * When combined with MCTS test-time search, KLENT c

Weaknesses

The main algorithm is almost identical to [1] and [2], which are applicable to broader class of games. The main difference to those algorithms seem to be the new policy-update rule and that KLENT does not use the regularization policy update from [2]. However, the paper does not explain the relationship in detail nor does it provide direct empirical comparison. Compared to [2], which is off-policy, KLENT works only on-policy. However, the extension to off-policy KLENT seems plausible. The str

Reviewer 03Rating 6Confidence 4

Strengths

- KLENT shows state-of-the-art performance on board games benchmark, outperforming AlphaZero and classic DRL algorithms. - The proposed KLENT algorithm is relatively simple which allows for further improvements and application of key ideas in the future research - A solid research on hyperparameter sensitivity for proposed algorithm.

Weaknesses

- Ablation study is not conducted well enough. While it is present in the paper, there are no direct comparision of KLENT with ablated features and KLENT without ablation.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Advanced Bandit Algorithms Research