GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation

Ionut-Teodor Sorodoc; Leonardo F. R. Ribeiro; Rexhina Blloshmi; Christopher Davis; Adri\`a de Gispert

arXiv:2506.07671·cs.CL·June 10, 2025

GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation

Ionut-Teodor Sorodoc, Leonardo F. R. Ribeiro, Rexhina Blloshmi, Christopher Davis, Adri\`a de Gispert

PDF

Open Access 1 Datasets

TL;DR

GaRAGe is a comprehensive benchmark with detailed grounding annotations designed to evaluate how well large language models can identify and utilize relevant information in long-form question answering across diverse topics and complexities.

Contribution

This paper introduces GaRAGe, a novel benchmark with human-curated annotations for fine-grained evaluation of LLM grounding capabilities in RAG tasks.

Findings

01

Models often over-summarize rather than ground answers on relevant passages.

02

Models have a low true positive rate in deflecting when no relevant info is available.

03

Performance drops significantly with time-sensitive questions and sparse sources.

Abstract

We present GaRAGe, a large RAG benchmark with human-curated long-form answers and annotations of each grounding passage, allowing a fine-grained evaluation of whether LLMs can identify relevant grounding when generating RAG answers. Our benchmark contains 2366 questions of diverse complexity, dynamism, and topics, and includes over 35K annotated passages retrieved from both private document sets and the Web, to reflect real-world RAG use cases. This makes it an ideal test bed to evaluate an LLM's ability to identify only the relevant information necessary to compose a response, or provide a deflective response when there is insufficient information. Evaluations of multiple state-of-the-art LLMs on GaRAGe show that the models tend to over-summarise rather than (a) ground their answers strictly on the annotated relevant passages (reaching at most a Relevance-Aware Factuality Score of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AmazonScience/GaRAGe
dataset· 78 dl
78 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods