# Non-Asymptotic Analysis of Monte Carlo Tree Search

**Authors:** Devavrat Shah, Qiaomin Xie, Zhi Xu

arXiv: 1902.05213 · 2020-01-14

## TL;DR

This paper provides a non-asymptotic analysis of Monte Carlo Tree Search (MCTS) in reinforcement learning, establishing polynomial concentration of regret for non-stationary bandits and demonstrating near-optimal sample complexity for value function approximation.

## Contribution

It proves polynomial regret concentration for non-stationary MABs and shows MCTS with nearest neighbor learning effectively improves value estimates with near-optimal sample complexity.

## Key findings

- Polynomial regret concentration for non-stationary MABs.
- MCTS with nearest neighbor achieves near-optimal sample complexity.
- Sample complexity scales as O(\u03b5^{-(d+4)}) for  approximation.

## Abstract

In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in the context of infinite-horizon discounted cost Markov Decision Process (MDP). While MCTS is believed to provide an approximate value function for a given state with enough simulations, the claimed proof in the seminal works is incomplete. This is due to the fact that the variant, the Upper Confidence Bound for Trees (UCT), analyzed in prior works utilizes "logarithmic" bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multi-arm bandit (MAB) literature. In effect, such an approach assumes that the regret of the underlying recursively dependent non-stationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold as pointed out in literature, even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of non-stationary MAB. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has the claimed property. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a "policy improvement" operator: it iteratively improves value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an $\varepsilon$ approximation of the value function with respect to $\ell_\infty$ norm, MCTS combined with nearest neighbor requires a sample size scaling as $\widetilde{O}\big(\varepsilon^{-(d+4)}\big)$, where $d$ is the dimension of the state space. This is nearly optimal due to a minimax lower bound of $\widetilde{\Omega}\big(\varepsilon^{-(d+2)}\big)$.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1902.05213/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/1902.05213/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/1902.05213/full.md

---
Source: https://tomesphere.com/paper/1902.05213