POLY-HOOT: Monte-Carlo Planning in Continuous Space MDPs with Non-Asymptotic Analysis
Weichao Mao, Kaiqing Zhang, Qiaomin Xie, Tamer Ba\c{s}ar

TL;DR
This paper introduces POLY-HOOT, a novel Monte-Carlo planning algorithm for continuous spaces that combines HOO with polynomial bonuses, providing non-asymptotic convergence guarantees and empirical validation.
Contribution
It presents POLY-HOOT, integrating polynomial bonuses into HOO for continuous MDPs, with theoretical regret bounds and convergence guarantees.
Findings
POLY-HOOT achieves polynomial convergence rates.
The polynomial bonus improves empirical performance.
Theoretical regret bounds are established for non-stationary bandits.
Abstract
Monte-Carlo planning, as exemplified by Monte-Carlo Tree Search (MCTS), has demonstrated remarkable performance in applications with finite spaces. In this paper, we consider Monte-Carlo planning in an environment with continuous state-action spaces, a much less understood problem with important applications in control and robotics. We introduce POLY-HOOT, an algorithm that augments MCTS with a continuous armed bandit strategy named Hierarchical Optimistic Optimization (HOO) (Bubeck et al., 2011). Specifically, we enhance HOO by using an appropriate polynomial, rather than logarithmic, bonus term in the upper confidence bounds. Such a polynomial bonus is motivated by its empirical successes in AlphaGo Zero (Silver et al., 2017b), as well as its significant role in achieving theoretical guarantees of finite space MCTS (Shah et al., 2019). We investigate, for the first time, the regret of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Artificial Intelligence in Games
MethodsMonte-Carlo Tree Search
