Measurable Monte Carlo Search Error Bounds
John Mern, Mykel J. Kochenderfer

TL;DR
This paper introduces computable bounds on the sub-optimality of Monte Carlo search estimates for non-stationary bandits and Markov decision processes, enabling confidence assessment without knowing true action-values.
Contribution
It provides the first practical bounds on Monte Carlo search error that can be computed after search without true value knowledge, applicable to general Monte Carlo solvers.
Findings
Bounds are tight in empirical tests on bandits and MDPs.
Bounds are applicable to a wide class of Monte Carlo solvers.
Experimental validation shows the bounds effectively measure sub-optimality.
Abstract
Monte Carlo planners can often return sub-optimal actions, even if they are guaranteed to converge in the limit of infinite samples. Known asymptotic regret bounds do not provide any way to measure confidence of a recommended action at the conclusion of search. In this work, we prove bounds on the sub-optimality of Monte Carlo estimates for non-stationary bandits and Markov decision processes. These bounds can be directly computed at the conclusion of the search and do not require knowledge of the true action-value. The presented bound holds for general Monte Carlo solvers meeting mild convergence conditions. We empirically test the tightness of the bounds through experiments on a multi-armed bandit and a discrete Markov decision process for both a simple solver and Monte Carlo tree search.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
