Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

Andrew Klearman; Radu Revutchi; Rohin Garg; Rishav Chakravarti; Samuel Marc Denton; Yuan Xue

arXiv:2604.20763·cs.IR·April 23, 2026

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

Andrew Klearman, Radu Revutchi, Rohin Garg, Rishav Chakravarti, Samuel Marc Denton, Yuan Xue

PDF

TL;DR

This paper introduces semantic stratification for retrieval evaluation, providing formal coverage guarantees and transparency, addressing biases in current heuristic-based methods.

Contribution

It formalizes retrieval evaluation as a statistical estimation problem and proposes semantic stratification to improve evaluation reliability and interpretability.

Findings

01

Exposes systematic coverage gaps in existing benchmarks.

02

Identifies structural signals influencing retrieval performance.

03

Shows stratified evaluation offers more stable, transparent assessments.

Abstract

Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.