Practical RAG Evaluation: A Rarity-Aware Set-Based Metric and Cost-Latency-Quality Trade-offs
Etienne Dallaire

TL;DR
This paper introduces a rarity-aware evaluation metric and a comprehensive benchmarking framework for production RAG systems, addressing limitations of classical IR metrics and providing reproducible, cost-aware decision tools.
Contribution
It proposes a new rarity-aware set score, a golden-set pipeline, and a detailed benchmark for production RAG, enabling better evaluation and optimization.
Findings
RA-nWG@K effectively measures rarity-aware retrieval quality.
The golden-set pipeline outperforms single-shot ranking methods.
Benchmark results reveal trade-offs among retrieval models, dimensions, and rerankers.
Abstract
This paper addresses the guessing game in building production RAG. Classical rank-centric IR metrics (nDCG/MAP/MRR) are a poor fit for RAG, where LLMs consume a set of passages rather than a browsed list; position discounts and prevalence-blind aggregation miss what matters: whether the prompt at cutoff K contains the decisive evidence. Second, there is no standardized, reproducible way to build and audit golden sets. Third, leaderboards exist but lack end-to-end, on-corpus benchmarking that reflects production trade-offs. Fourth, how state-of-the-art embedding models handle proper-name identity signals and conversational noise remains opaque. To address these, we contribute: (1) RA-nWG@K, a rarity-aware, per-query-normalized set score, and operational ceilings via the pool-restricted oracle ceiling (PROC) and the percentage of PROC (%PROC) to separate retrieval from ordering headroom…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Biomedical Text Mining and Ontologies
