TL;DR
HeroBench is a comprehensive benchmark designed to evaluate long-horizon planning and structured reasoning in virtual worlds, revealing significant performance gaps in current large language models.
Contribution
It introduces a complex RPG-inspired environment for end-to-end planning evaluation, integrating multiple reasoning skills and scalable difficulty levels.
Findings
25 state-of-the-art LLMs show large performance disparities.
No model reliably solves the hardest tasks.
Reasoning models outperform others but still struggle with complex planning.
Abstract
Large language models (LLMs) perform well on step-by-step reasoning benchmarks such as mathematics and code generation, yet their ability to carry out robust long-horizon planning under realistic constraints remains insufficiently evaluated. Existing planning benchmarks often rely on abstract domains or interactive feedback, obscuring end-to-end planning failures and feasibility errors. We introduce HeroBench, a benchmark for evaluating long-horizon, hierarchical planning and structured reasoning in a complex RPG-inspired virtual world. Tasks require models to select numerically feasible equipment, reason over multi-level crafting and resource dependencies, and execute hundreds to thousands of actions as a single end-to-end plan. HeroBench integrates symbolic planning, numeric combat simulation, spatial reasoning, and resource management, while supporting scalable difficulty and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
