RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

Hanjun Cho; Jay-Yoon Lee

arXiv:2604.19047·cs.CL·April 22, 2026

RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

Hanjun Cho, Jay-Yoon Lee

PDF

TL;DR

RARE is a new evaluation framework that accounts for redundancy in corpora, improving the assessment of retrieval systems in real-world, high-similarity document collections.

Contribution

It introduces a method to decompose documents into atomic facts and enhance data generation with CRRF, creating more realistic benchmarks for redundant corpora.

Findings

01

Applying RARE to Finance, Legal, and Patent data shows significant drops in retriever performance, highlighting robustness gaps.

02

RARE enables domain-specific RAG evaluations that better reflect real-world conditions.

Abstract

Existing QA benchmarks typically assume distinct documents with minimal overlap, yet real-world retrieval-augmented generation (RAG) systems operate on corpora such as financial reports, legal codes, and patents, where information is highly redundant and documents exhibit strong inter-document similarity. This mismatch undermines evaluation validity: retrievers can be unfairly undervalued even when they retrieve documents that provide sufficient evidence, because redundancy across documents is not accounted for in evaluation. On the other hand, retrievers that perform well on standard benchmarks often generalize poorly to real-world corpora with highly similar and redundant documents. We present RARE (Redundancy-Aware Retrieval Evaluation), a framework for constructing realistic benchmarks by (i) decomposing documents into atomic facts to enable precise redundancy tracking and (ii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.