HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

Petr Anokhin; Roman Khalikov; Stefan Rebrikov; Viktor Volkov; Artyom Sorokin; Vincent Bissonnette

arXiv:2508.12782·cs.AI·April 21, 2026

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, Vincent Bissonnette

PDF

1 Repo

TL;DR

HeroBench is a comprehensive benchmark designed to evaluate long-horizon planning and structured reasoning in virtual worlds, revealing significant performance gaps in current large language models.

Contribution

It introduces a complex RPG-inspired environment for end-to-end planning evaluation, integrating multiple reasoning skills and scalable difficulty levels.

Findings

01

25 state-of-the-art LLMs show large performance disparities.

02

No model reliably solves the hardest tasks.

03

Reasoning models outperform others but still struggle with complex planning.

Abstract

Large language models (LLMs) perform well on step-by-step reasoning benchmarks such as mathematics and code generation, yet their ability to carry out robust long-horizon planning under realistic constraints remains insufficiently evaluated. Existing planning benchmarks often rely on abstract domains or interactive feedback, obscuring end-to-end planning failures and feasibility errors. We introduce HeroBench, a benchmark for evaluating long-horizon, hierarchical planning and structured reasoning in a complex RPG-inspired virtual world. Tasks require models to select numerically feasible equipment, reason over multi-level crafting and resource dependencies, and execute hundreds to thousands of actions as a single end-to-end plan. HeroBench integrates symbolic planning, numeric combat simulation, spatial reasoning, and resource management, while supporting scalable difficulty and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stefanrer/HeroBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.