Controlled Retrieval-augmented Context Evaluation for Long-form RAG

Jia-Huei Ju; Suzan Verberne; Maarten de Rijke; Andrew Yates

arXiv:2506.20051·cs.IR·January 13, 2026

Controlled Retrieval-augmented Context Evaluation for Long-form RAG

Jia-Huei Ju, Suzan Verberne, Maarten de Rijke, Andrew Yates

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces CRUX, a new evaluation framework for assessing retrieval-augmented contexts in long-form RAG tasks, emphasizing the importance of context quality over relevance metrics, and reveals significant room for improvement in current retrieval methods.

Contribution

The paper proposes CRUX, a novel human-centered, question-based evaluation framework that directly measures the quality of retrieval-augmented contexts in long-form generation tasks.

Findings

01

CRUX provides more reflective and diagnostic evaluation of retrieval quality.

02

Current retrieval methods show substantial room for improvement.

03

CRUX enables fine-grained assessment of retrieval relevance in long-form RAG.

Abstract

Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources. While the effectiveness of the retrieval module is typically evaluated with relevance-based ranking metrics, such metrics may be insufficient to reflect the retrieval's impact on the final RAG result, especially in long-form generation scenarios. We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation and propose metrics for assessing the context independent of generation. We introduce CRUX, a \textbf{C}ontrolled \textbf{R}etrieval-a\textbf{U}gmented conte\textbf{X}t evaluation framework designed to directly assess retrieval-augmented contexts. This framework uses human-written summaries to control the information scope of knowledge, enabling us to measure how well the context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

DylanJHJ/crux
dataset· 154 dl
154 dl

Videos

Controlled Retrieval-augmented Context Evaluation for Long-form RAG· underline

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Byte Pair Encoding · Softmax · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · BERT · BART