TL;DR
This paper introduces Epistemic MCTS, a method that incorporates epistemic uncertainty into Monte Carlo Tree Search, improving exploration and sample efficiency in challenging sparse-reward tasks.
Contribution
The paper presents a theoretically motivated approach to account for epistemic uncertainty in MCTS, enhancing deep exploration in complex environments.
Findings
EMCTS achieves higher sample efficiency in SUBLEQ code writing tasks.
EMCTS solves Deep Sea benchmarks faster than baseline methods.
Incorporating epistemic uncertainty improves exploration in sparse reward settings.
Abstract
The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language SUBLEQ, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are…
Peer Reviews
Decision·ICLR 2025 Poster
1. The new algorithm solves MCTS with a learned model, with deep exploration and variance propagation. 2. The model is supported by theoretical insights. 3. The model performs quite well on the baseline tasks.
1. The practical value of epistemic MCTS remains questionable. In what real-world tasks do we need to run search without knowing the world model?
1. The paper provides a relatively rigorous theoretical foundation for EMCTS. The authors derive an upper confidence bound for the epistemic uncertain within the learned MCTS model, which theoretically justify the exploration strategy of EMCTS. 2. The experiments demonstrate the effectiveness of EMCTS algorithm in SUBLEQ and Deep Sea task. EMCTS show a better sample efficiency than other baseline methods. Notably, EMCTS demonstrate deep exploration in the presence of the learned value, reward,
1. The authors assume consistency across the reward and value. This assumption might not hold in dynamic environments or with certain deep neural network structures in practice. To what degree do those assumptions hold in practice? Are there any experiments results showing that those assumptions hold (or nearly hold) in practice? How does EMCTS handle potential inconsistencies in epistemic uncertainty when the learned models (reward, value, transition) are not fully aligned? 2. The upper bound
1. This manuscript propose a novel and solid framework to introduce and deal with uncertainty of learned model caused by limited data or sparse reward environments, which is not taken into account in the standard MCTS. 2. EMCTS solves the two-fold objective, i.e., extend MCTS to estimate and propagate the epistemic uncertainty from the uncertain learned model and harness the epistemic uncertainty in the search to achieve deep exploration of the environment, via formulating the epistemic uncertai
1. Some assumptions might make EMCTS inapplicable to real complex problems. They assume that the transition model P is true. Theorem 1 need $Q^{\pi}$ to be linear. The learned models are assumed to be consistent. These is not reasonable in many cases. 2. In many cases, the transition models need to be estimated, and then the EMCTS will not work.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Average Pooling · Monte-Carlo Tree Search · Prioritized Experience Replay · Batch Normalization · Residual Block · Convolution · MuZero
