TL;DR
This paper introduces a novel bandit algorithm, UCB1-Uniform, based on Extreme Value Theory, to improve Monte Carlo Tree Search in classical planning, with proven regret bounds and empirical validation.
Contribution
It applies Extreme Value Theory to refine bandit support and backup methods, proposing UCB1-Uniform with theoretical guarantees for classical planning.
Findings
UCB1-Uniform outperforms previous bandit algorithms in classical planning tasks.
Theoretical regret bounds are established for UCB1-Uniform.
Empirical results demonstrate improved planning performance using the new algorithm.
Abstract
Despite being successful in board games and reinforcement learning (RL), Monte Carlo Tree Search (MCTS) combined with Multi Armed Bandits (MABs) has seen limited success in domain-independent classical planning until recently. Previous work (Wissow and Asai 2024) showed that UCB1, designed for bounded rewards, does not perform well as applied to cost-to-go estimates in classical planning, which are unbounded in , and showed improved performance using a Gaussian reward MAB instead. This paper further sharpens our understanding of ideal bandits for planning tasks. Existing work has two issues: first, Gaussian MABs under-specify the support of cost-to-go estimates as , which we can narrow down. Second, Full Bellman backup (Schulte and Keller 2014), which backpropagates sample max/min, lacks theoretical justification. We use \emph{Peaks-Over-Threashold Extreme Value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
