Monte-Carlo Tree Search as Regularized Policy Optimization
Jean-Bastien Grill, Florent Altch\'e, Yunhao Tang, Thomas Hubert,, Michal Valko, Ioannis Antonoglou, R\'emi Munos

TL;DR
This paper reveals that AlphaZero's heuristics are approximations to a regularized policy optimization problem and introduces a variant that improves performance by solving this problem exactly.
Contribution
It provides a theoretical understanding of AlphaZero's heuristics and proposes an improved algorithm based on exact solutions to the regularized policy optimization problem.
Findings
The proposed variant outperforms AlphaZero in multiple domains.
AlphaZero's heuristics approximate a regularized policy optimization solution.
The new method offers more reliable and improved performance.
Abstract
The combination of Monte-Carlo tree search (MCTS) with deep reinforcement learning has led to significant advances in artificial intelligence. However, AlphaZero, the current state-of-the-art MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero's search heuristics, along with other common ones such as UCT, are an approximation to the solution of a specific regularized policy optimization problem. With this insight, we propose a variant of AlphaZero which uses the exact solution to this policy optimization problem, and show experimentally that it reliably outperforms the original algorithm in multiple domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Machine Learning and Data Classification
MethodsAlphaZero · Monte-Carlo Tree Search
