LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

Juliusz Ziomek; William Bankes; Lorenz Wolf; Shyam Sundhar Ramesh; Xiaohang Tang; Ilija Bogunovic

arXiv:2602.16902·cs.AI·February 24, 2026

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

PDF

Open Access 1 Datasets

TL;DR

LLM-Wikirace is a benchmark testing large language models' ability to plan and reason over real-world knowledge graphs, revealing significant challenges in complex, hard-level tasks despite strong performance on easier levels.

Contribution

This paper introduces LLM-Wikirace, a novel benchmark for evaluating planning and reasoning in LLMs using Wikipedia navigation tasks, highlighting current limitations and the importance of world knowledge.

Findings

01

Models perform well on easy tasks but struggle on hard ones.

02

Performance drops sharply on complex, long-horizon planning tasks.

03

Even top models often fail to replan after errors, entering loops.

Abstract

We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

juliusz-ziomek/LLM-WikiRace-Benchmark
dataset· 12 dl
12 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · AI-based Problem Solving and Planning · Advanced Graph Neural Networks