Epistemic Monte Carlo Tree Search

Yaniv Oren; Viliam Vadocz; Matthijs T. J. Spaan; Wendelin B\"ohmer

arXiv:2210.13455·cs.LG·May 18, 2026

Epistemic Monte Carlo Tree Search

Yaniv Oren, Viliam Vadocz, Matthijs T. J. Spaan, Wendelin B\"ohmer

PDF

1 Video 3 Reviews

TL;DR

This paper introduces Epistemic MCTS, a method that incorporates epistemic uncertainty into Monte Carlo Tree Search, improving exploration and sample efficiency in challenging sparse-reward tasks.

Contribution

The paper presents a theoretically motivated approach to account for epistemic uncertainty in MCTS, enhancing deep exploration in complex environments.

Findings

01

EMCTS achieves higher sample efficiency in SUBLEQ code writing tasks.

02

EMCTS solves Deep Sea benchmarks faster than baseline methods.

03

Incorporating epistemic uncertainty improves exploration in sparse reward settings.

Abstract

The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language SUBLEQ, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The new algorithm solves MCTS with a learned model, with deep exploration and variance propagation. 2. The model is supported by theoretical insights. 3. The model performs quite well on the baseline tasks.

Weaknesses

1. The practical value of epistemic MCTS remains questionable. In what real-world tasks do we need to run search without knowing the world model?

Reviewer 02Rating 6Confidence 2

Strengths

1. The paper provides a relatively rigorous theoretical foundation for EMCTS. The authors derive an upper confidence bound for the epistemic uncertain within the learned MCTS model, which theoretically justify the exploration strategy of EMCTS. 2. The experiments demonstrate the effectiveness of EMCTS algorithm in SUBLEQ and Deep Sea task. EMCTS show a better sample efficiency than other baseline methods. Notably, EMCTS demonstrate deep exploration in the presence of the learned value, reward,

Weaknesses

1. The authors assume consistency across the reward and value. This assumption might not hold in dynamic environments or with certain deep neural network structures in practice. To what degree do those assumptions hold in practice? Are there any experiments results showing that those assumptions hold (or nearly hold) in practice? How does EMCTS handle potential inconsistencies in epistemic uncertainty when the learned models (reward, value, transition) are not fully aligned? 2. The upper bound

Reviewer 03Rating 6Confidence 4

Strengths

1. This manuscript propose a novel and solid framework to introduce and deal with uncertainty of learned model caused by limited data or sparse reward environments, which is not taken into account in the standard MCTS. 2. EMCTS solves the two-fold objective, i.e., extend MCTS to estimate and propagate the epistemic uncertainty from the uncertain learned model and harness the epistemic uncertainty in the search to achieve deep exploration of the environment, via formulating the epistemic uncertai

Weaknesses

1. Some assumptions might make EMCTS inapplicable to real complex problems. They assume that the transition model P is true. Theorem 1 need $Q^{\pi}$ to be linear. The learned models are assumed to be consistent. These is not reasonable in many cases. 2. In many cases, the transition models need to be estimated, and then the EMCTS will not work.

Videos

Epistemic Monte Carlo Tree Search· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Average Pooling · Monte-Carlo Tree Search · Prioritized Experience Replay · Batch Normalization · Residual Block · Convolution · MuZero