Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search

Jonathan Light; Min Cai; Weiqin Chen; Guanzhi Wang; Xiusi Chen; Wei Cheng; Yisong Yue; Ziniu Hu

arXiv:2408.10635·cs.AI·July 30, 2025

Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search

Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

STRATEGIST is a novel framework combining LLMs and Monte Carlo Tree Search to improve decision-making in complex, multi-turn games without training data, outperforming traditional RL methods and matching human performance.

Contribution

It introduces a generalizable, training-free approach that integrates LLMs with tree search to optimize strategies through self-play in complex games.

Findings

01

Outperforms traditional RL agents in multi-turn, partial information games.

02

Achieves competitive performance against human players.

03

Effective in learning strategies without any training data.

Abstract

Traditional reinforcement learning and planning typically requires vast amounts of data and training to develop effective policies. In contrast, large language models (LLMs) exhibit strong generalization and zero-shot capabilities, but struggle with tasks that require detailed planning and decision-making in complex action spaces. We introduce STRATEGIST, a novel approach that integrates the strengths of both methods. Our approach leverages LLMs to search and update high-level strategies (as text), which are then refined and executed by low-level Monte Carlo Tree Search (MCTS). STRATEGIST is a generalizable framework to optimize the strategy through population-based self-play simulations without the need for any training data. We demonstrate the effectiveness of STRATEGIST in learning optimal strategies for competitive, multi-turn games with partial information, including Game of Pure…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 3Confidence 3

Strengths

1. The idea of using LLM to generate high-level strategy and use MCTS to get low-level strategy is interesting. 2. The proposed method is evaluated in two game environments, one simple and one complex, which is good.

Weaknesses

1. Lack of Competitive Baselines: The paper compares with methods that do not use LLM at all (like DeepRole) and methods that only uses LLM but no RL. However, the comparisons exclude more recent RL-LLM hybrid methods that would provide a fairer benchmark for STRATEGIST's effectiveness. In fact, in the past two years, there are a number of papers that combines RL with LLM to play complex strategic multi-agent games that involve natural-language based communication. For example, the paper mention

Reviewer 02Rating 6Confidence 3

Strengths

- The paper is well written and well organised. - The description of different components in STRATEGIST is clear. - The experimental results are strong and comprehensive.

Weaknesses

- The differences and advantages of the self-play mechanisms of STRATEGIST in improving the strategy against previous related work on self-play for LLMs is unclear. It would be helpful to provide a clear comparison (e.g. a table) between the self-play mechanism in STRATEGIST and previous self-play methods for LLMs (e.g. [1] [2] ...) . [1] Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv prepr

Reviewer 03Rating 6Confidence 4

Strengths

1. Novel approach: the combination of LLMs with a bi-level tree search for strategy improvement is a novel method for enhancing performance in complex multi-agent games. Addressing both high-level strategic planning and low-level execution allows for improvements of agent capabilities. 2. Good empirical evaluation: The authors conduct various ablation studies and additional experiments comparing their method with established baselines like DeepRole and ReCon, providing a broader context for thei

Weaknesses

1. Reliability of population-based self-play simulation: the authors use round-robin games between top-ten strategies to evaluate the performance of high-level strategies. Since games like Avalon have high uncertainty and the variance of the simulation result is large, it would require many simulations (like hundreds) to get a reliable evaluation of these strategies. However, these simulations would take a long time for LLM inferences. In addition, LLM usually cannot take so many trajectories as

Code & Models

Repositories

jonathanmli/avalon-llm
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Technology and Assessment · Data Mining Algorithms and Applications · Big Data and Business Intelligence