TL;DR
LudoBench is a benchmark for assessing large language models' strategic reasoning in the complex, stochastic game of Ludo, using handcrafted scenarios and a functional simulator to analyze model behaviors and vulnerabilities.
Contribution
Introduces LudoBench, a comprehensive benchmark with scenarios and a simulator to evaluate LLM strategic reasoning in Ludo, highlighting behavioral archetypes and prompt sensitivity issues.
Findings
Models agree with game-theory baseline only 40-46% of the time.
Models fall into archetypes: finishers and builders, each capturing only half of the strategy.
Behavioral shifts occur under history-conditioned grudge framing, indicating prompt sensitivity.
Abstract
We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents. The game-theory agent uses Expectiminimax search with depth-limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
