PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation
Albert Gong, Kamil\.e Stankevi\v{c}i\=ut\.e, Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P. Gomes, Kilian Q. Weinberger

TL;DR
PhantomWiki introduces a novel on-demand dataset generation pipeline for evaluating reasoning and retrieval in large language models, addressing issues of data leakage and dataset inflation by creating unique, customizable corpora for each evaluation.
Contribution
It presents a scalable, data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities in LLMs, unlike fixed or pre-existing datasets.
Findings
PhantomWiki datasets are surprisingly challenging for frontier LLMs.
The framework enables disentangled evaluation of reasoning and retrieval capabilities.
On-demand generation reduces data leakage and inflation issues.
Abstract
High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
