LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?
Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

TL;DR
LLM-Wikirace is a benchmark testing large language models' ability to plan and reason over real-world knowledge graphs, revealing significant challenges in complex, hard-level tasks despite strong performance on easier levels.
Contribution
This paper introduces LLM-Wikirace, a novel benchmark for evaluating planning and reasoning in LLMs using Wikipedia navigation tasks, highlighting current limitations and the importance of world knowledge.
Findings
Models perform well on easy tasks but struggle on hard ones.
Performance drops sharply on complex, long-horizon planning tasks.
Even top models often fail to replan after errors, entering loops.
Abstract
We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · AI-based Problem Solving and Planning · Advanced Graph Neural Networks
