PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

Albert Gong; Kamil\.e Stankevi\v{c}i\=ut\.e; Chao Wan; Anmol Kabra; Raphael Thesmar; Johann Lee; Julius Klenke; Carla P. Gomes; Kilian Q. Weinberger

arXiv:2502.20377·cs.LG·June 10, 2025

PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

Albert Gong, Kamil\.e Stankevi\v{c}i\=ut\.e, Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P. Gomes, Kilian Q. Weinberger

PDF

Open Access 1 Repo 1 Datasets

TL;DR

PhantomWiki introduces a novel on-demand dataset generation pipeline for evaluating reasoning and retrieval in large language models, addressing issues of data leakage and dataset inflation by creating unique, customizable corpora for each evaluation.

Contribution

It presents a scalable, data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities in LLMs, unlike fixed or pre-existing datasets.

Findings

01

PhantomWiki datasets are surprisingly challenging for frontier LLMs.

02

The framework enables disentangled evaluation of reasoning and retrieval capabilities.

03

On-demand generation reduces data leakage and inflation issues.

Abstract

High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kilian-group/phantom-wiki
noneOfficial

Datasets

kilian-group/phantom-wiki-v1
dataset· 638 dl
638 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education