Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Nitay Calderon; Eyal Ben-David; Zorik Gekhman; Eran Ofek; Gal Yona

arXiv:2602.14080·cs.CL·February 17, 2026

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Nitay Calderon, Eyal Ben-David, Zorik Gekhman, Eran Ofek, Gal Yona

PDF

Open Access

TL;DR

This paper introduces a new framework and benchmark for analyzing factual errors in large language models, revealing that access to encoded knowledge, rather than encoding itself, is the main bottleneck, and that reasoning can improve recall.

Contribution

It proposes a fact-level profiling framework and WikiProfile benchmark to distinguish between knowledge encoding and access issues in LLMs, highlighting the importance of retrieval and reasoning.

Findings

01

Encoding of facts is nearly saturated in frontier models.

02

Recall failures are systematic and affect long-tail and reverse questions.

03

Thinking methods can recover a significant portion of recall failures.

Abstract

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education