C-MCTS: Safe Planning with Monte Carlo Tree Search
Dinesh Parthasarathy, Georgios Kontes, Axel Plinge, Christopher, Mutschler

TL;DR
C-MCTS introduces a safety critic trained offline to guide Monte Carlo Tree Search in constrained decision-making, improving safety, reward, and efficiency in safety-critical tasks under model mismatch.
Contribution
It proposes Constrained MCTS (C-MCTS) with a safety critic for better safety and efficiency in constrained planning, addressing high variance issues in previous methods.
Findings
C-MCTS satisfies cost constraints while achieving higher rewards.
It operates closer to the constraint boundary, improving reward.
It is more robust to model mismatch, reducing violations.
Abstract
The Constrained Markov Decision Process (CMDP) formulation allows to solve safety-critical decision making tasks that are subject to constraints. While CMDPs have been extensively studied in the Reinforcement Learning literature, little attention has been given to sampling-based planning algorithms such as MCTS for solving them. Previous approaches perform conservatively with respect to costs as they avoid constraint violations by using Monte Carlo cost estimates that suffer from high variance. We propose Constrained MCTS (C-MCTS), which estimates cost using a safety critic that is trained with Temporal Difference learning in an offline phase prior to agent deployment. The critic limits exploration by pruning unsafe trajectories within MCTS during deployment. C-MCTS satisfies cost constraints but operates closer to the constraint boundary, achieving higher rewards than previous work. As…
Peer Reviews
Decision·Submitted to ICLR 2024
- C-MCTS achieves improving performance in the quality of the solutions while not violating the cost constraints.
- The actual running time of the experiments needs to be provided. - Several points need to be clarified in the explanation that is described in the Questions below.
**Orignality:** While not introducing a completely novel approach, they apply an offline learning technique to estimate costs in CMDPs online. **Significance:** The contributions of the paper lack significance. **Clarity:** The paper is understandable.
The main drawback of the paper is its lack of significance. The approach introduced is not novel. Learning values/cost estimates offline to be applied online is not a new idea. Nor is the learning approach using a novel technique. I also question the soundness of the analysis. In Prop. 1, the authors claim that at each iteration of their algorithm, they are guaranteed to find the optimal solution. They base their claim on the proof in [Kocsis & Szepesvari, 2006]. However, that work states that
Dealing with safety constraint is probably one of the weaknesses of the RL approaches, and one of the main obstacles for applying RL and planning algorithms like MCTS in real world scenarios. While in many of those scenarios, the algorithms are faced with continuous state/action spaces, tackling the issue in discrete spaces is also important. The extension of the MCTS for constrained MDP seems fairly reasonable.
While there is some theoretical work included, these do not offer sufficient guarantees for practical applicability. The proposed algorithm could be a step towards an practical application, but it is not there as it is. Given that this is a largely empirical article, the experimental evaluation is rather small. The benchmarks are small and fairly simple, while that set of baselines is also limited.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Adversarial Robustness in Machine Learning · AI-based Problem Solving and Planning
MethodsPruning
