Doubly Robust Monte Carlo Tree Search
Manqing Liu, Andrew L. Beam

TL;DR
Doubly Robust Monte Carlo Tree Search (DR-MCTS) integrates DR off-policy estimation into MCTS to improve sample efficiency and decision accuracy in complex environments, with proven theoretical guarantees and superior empirical performance.
Contribution
This paper introduces DR-MCTS, a hybrid estimator combining MCTS rollouts with Doubly Robust estimation, providing unbiasedness and variance reduction for enhanced decision-making.
Findings
DR-MCTS outperforms standard MCTS in Tic-Tac-Toe with 88% win rate.
In VirtualHome, DR-MCTS achieves 20.7% success rate versus 10.3%.
DR-MCTS shows better sample efficiency, especially with larger language models.
Abstract
We present Doubly Robust Monte Carlo Tree Search (DR-MCTS), a novel algorithm that integrates Doubly Robust (DR) off-policy estimation into Monte Carlo Tree Search (MCTS) to enhance sample efficiency and decision quality in complex environments. Our approach introduces a hybrid estimator that combines MCTS rollouts with DR estimation, offering theoretical guarantees of unbiasedness and variance reduction under specified conditions. Empirical evaluations in Tic-Tac-Toe and the partially observable VirtualHome environment demonstrate DR-MCTS's superior performance over standard MCTS. In Tic-Tac-Toe, DR-MCTS achieves an 88% win rate compared to a 10% win rate for standard MCTS. In compound VirtualHome tasks, DR-MCTS attains a 20.7% success rate versus 10.3% for standard MCTS. Our scaling analysis reveals that DR-MCTS exhibits better sample efficiency, notably outperforming standard MCTS…
Peer Reviews
Decision·Submitted to ICLR 2026
- **Novel Integration of Doubly Robust Estimation in MCTS:** The paper presents a novel approach to improving the simulation efficiency of MCTS, a long-standing challenge. Instead of simply truncating rollouts or replacing them with a learned value function, the authors are the first to propose integrating doubly robust (DR) off-policy evaluation directly into the MCTS value backup. The quality of this contribution is further deepened by the proposal of an adaptive hybrid estimator, introducin
**1. Lack of Empirical Comparison for the Core Contribution (**$V_{hybrid}$**)**: - The paper's main contribution is the "adaptive hybrid estimator" ($V_{hybrid}$), which claims to minimize variance by dynamically combining $V_{MCTS}$ and $V_{DR}$. However, the empirical evidence to support this claim is critically lacking. - There are no ablation studies that directly compare the performance of using only $V_{MCTS}$, only $V_{DR}$, and the proposed $V_{hybrid}$. - Furthermore, to prove that $V
The paper offers valuable insights and contributions towards constructing improved MCTS algorithms, specifically in achieving variance reduction and enhancing efficiency. It builds effectively upon previous technologies in the MCTS domain. A notable strength is the authors' attempt to provide important theoretical guarantees regarding the applicability of their DR-MCTS algorithms, clarifying the specific circumstances under which variance reduction can be realized.
The depth of the theoretical contribution is somewhat open to question. Lemma 2.1 appears to be relatively straightforward to prove, as it primarily involves taking the expectation over the additive components of the value function. Furthermore, the condition for equality in Lemma 2.2 seems quite strict; it would be helpful to understand if this holds by definition for all DR-MCTS instances or requires stringent enforcement. Regarding the algorithmic design, the contribution of the hybrid DR-MC
- The authors propose a simple modification to the rollout step for MCTS by using an IS-weighted advantage instead of the returns, which reduces variance. - Quite an extensive related work section, which I found interesting for comparing recent approaches for value estimation/ rollouts in MCTS - Proper choice of confidence intervals (Wilson intervals) for the win-rates in the experiment section, which I too rarely see other researchers do.
The LLM statement on page 10 includes "No LLMs were used for data generation, ...". The authors then contradict themselves in eg., the paragraph in lines 363-372 where GPT-4o was used to generate the policy prior, or lines 347-356 where GPT-4o-mini was used as a world-model. *Major comments:* The introduction claims that the new method achieves superior performance and decision quality, however, none of the results show statistical significance. Aside from the result that the new method is cons
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games
