
TL;DR
This paper introduces a VOI-based sampling policy for Monte Carlo tree search that aims to improve move selection by better estimating the value of information from rollouts, with empirical validation on MAB problems and Computer Go.
Contribution
It proposes a novel VOI-aware sampling policy for MCTS that differs from UCB1 and UCT, focusing on the value of information from rollouts.
Findings
VOI-based policy outperforms UCB1 and UCT on certain MAB instances.
The approach shows promising results in Computer Go scenarios.
Empirical evaluation demonstrates the effectiveness of the VOI-aware method.
Abstract
UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in games and Markov decision processes, is based on UCB1, a sampling policy for the Multi-armed Bandit problem (MAB) that minimizes the cumulative regret. However, search differs from MAB in that in MCTS it is usually only the final "arm pull" (the actual move selection) that collects a reward, rather than all "arm pulls". In this paper, an MCTS sampling policy based on Value of Information (VOI) estimates of rollouts is suggested. Empirical evaluation of the policy and comparison to UCB1 and UCT is performed on random MAB instances as well as on Computer Go.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSports Analytics and Performance · Advanced Bandit Algorithms Research · Reinforcement Learning in Robotics
