Towards a rigorous evaluation of RAG systems: the challenge of due diligence

Gr\'egoire Martinon; Alexandra Lorenzo de Brionne; J\'er\^ome Bohard; Antoine Lojou; Damien Hervault; Nicolas J-B. Brunel (ENSIIE; LaMME)

arXiv:2507.21753·cs.AI·July 30, 2025

Towards a rigorous evaluation of RAG systems: the challenge of due diligence

Gr\'egoire Martinon, Alexandra Lorenzo de Brionne, J\'er\^ome Bohard, Antoine Lojou, Damien Hervault, Nicolas J-B. Brunel (ENSIIE, LaMME)

PDF

TL;DR

This paper presents a rigorous evaluation protocol for Retrieval-Augmented Generation systems, addressing reliability issues like hallucinations in high-stakes applications such as finance and healthcare.

Contribution

It introduces a robust, statistically guaranteed evaluation method combining human and LLM-based annotations, along with a comprehensive dataset for RAG system assessment.

Findings

01

Effective identification of system failures like hallucinations and off-topic responses

02

Enhanced evaluation reliability with statistical guarantees

03

Provision of a comprehensive dataset for future research

Abstract

The rise of generative AI, has driven significant advancements in high-risk sectors like healthcare and finance. The Retrieval-Augmented Generation (RAG) architecture, combining language models (LLMs) with search engines, is particularly notable for its ability to generate responses from document corpora. Despite its potential, the reliability of RAG systems in critical contexts remains a concern, with issues such as hallucinations persisting. This study evaluates a RAG system used in due diligence for an investment fund. We propose a robust evaluation protocol combining human annotations and LLM-Judge annotations to identify system failures, like hallucinations, off-topic, failed citations, and abstentions. Inspired by the Prediction Powered Inference (PPI) method, we achieve precise performance measurements with statistical guarantees. We provide a comprehensive dataset for further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.