TL;DR
This paper introduces a systematic method to derive prior-based UCTs from a broad class of UCBs, leading to variance-aware policies that outperform existing methods in Monte Carlo Tree Search benchmarks.
Contribution
The authors develop Inverse-RPO, a general approach to derive prior-based UCTs from various UCBs, and introduce variance-aware UCTs that improve performance without extra computational cost.
Findings
Variance-aware UCTs outperform PUCT in benchmarks.
Minimal code changes needed for variance-aware UCTs.
Inverse-RPO provides a systematic derivation of prior-based UCTs.
Abstract
Monte Carlo Tree Search (MCTS) has profoundly influenced reinforcement learning (RL) by integrating planning and learning in tasks requiring long-horizon reasoning, exemplified by the AlphaZero family of algorithms. Central to MCTS is the search strategy, governed by a tree policy based on an upper confidence bound (UCB) applied to trees (UCT). A key factor in the success of AlphaZero is the introduction of a prior term in the UCB1-based tree policy PUCT, which improves exploration efficiency and thus accelerates training. While many alternative UCBs with stronger theoretical guarantees than UCB1 exist, extending them to prior-based UCTs has been challenging, since PUCT was derived empirically rather than from first principles. Recent work retrospectively justified PUCT by framing MCTS as a regularized policy optimization (RPO) problem. Building on this perspective, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
