
TL;DR
This paper introduces a new Monte Carlo tree search method based on simple regret minimization, which empirically outperforms UCT by focusing on the final move selection rather than cumulative rewards.
Contribution
It proposes policies for multi-armed bandits that minimize simple regret, leading to a novel two-stage MCTS scheme and a VOI-aware sampling method that outperform existing algorithms.
Findings
The SR+CR scheme outperforms UCT empirically.
VOI-aware sampling improves search efficiency.
New policies reduce simple regret more effectively.
Abstract
UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in games and Markov decision processes, is based on UCB, a sampling policy for the Multi-armed Bandit problem (MAB) that minimizes the cumulative regret. However, search differs from MAB in that in MCTS it is usually only the final "arm pull" (the actual move selection) that collects a reward, rather than all "arm pulls". Therefore, it makes more sense to minimize the simple regret, as opposed to the cumulative regret. We begin by introducing policies for multi-armed bandits with lower finite-time and asymptotic simple regret than UCB, using it to develop a two-stage scheme (SR+CR) for MCTS which outperforms UCT empirically. Optimizing the sampling process is itself a metareasoning problem, a solution of which can use value of information (VOI) techniques. Although the theory of VOI for search exists, applying it to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
