Power Mean Estimation in Stochastic Monte-Carlo Tree_Search
Tuan Dam, Odalric-Ambrym Maillard, Emilie Kaufmann

TL;DR
This paper introduces Stochastic-Power-UCT, an MCTS algorithm using the power mean estimator for stochastic environments, providing theoretical convergence guarantees and empirical validation.
Contribution
It develops a new MCTS algorithm with the power mean estimator tailored for stochastic MDPs and proves its polynomial convergence rate.
Findings
Shares the same convergence rate of O(n^{-1/2}) as Fixed-Depth-MCTS
Theoretical analysis confirms polynomial convergence in stochastic MDPs
Empirical tests validate the theoretical results across various environments
Abstract
Monte-Carlo Tree Search (MCTS) is a widely-used strategy for online planning that combines Monte-Carlo sampling with forward tree search. Its success relies on the Upper Confidence bound for Trees (UCT) algorithm, an extension of the UCB method for multi-arm bandits. However, the theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection, leading to the development of Fixed-Depth-MCTS with a polynomial exploration bonus to balance exploration and exploitation~\citep{shah2022journal}. Both UCT and Fixed-Depth-MCTS suffer from biased value estimation: the weighted sum underestimates the optimal value, while the maximum valuation overestimates it~\citep{coulom2006efficient}. The power mean estimator offers a balanced solution, lying between the average and maximum values. Power-UCT~\citep{dam2019generalized} incorporates this estimator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
