LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

David Carmel; Simone Filice; Guy Horowitz; Yoelle Maarek; Alex Shtoff; Oren Somekh; Ran Tavory

arXiv:2511.14531·cs.CL·November 19, 2025

LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Alex Shtoff, Oren Somekh, Ran Tavory

PDF

Open Access 1 Datasets

TL;DR

The paper introduces LiveRAG, a comprehensive synthetic dataset with varying difficulty levels for evaluating Retrieval Augmented Generation systems, aiding systematic assessment and development of more robust Q&A models.

Contribution

It presents a new benchmark dataset with difficulty annotations, derived from a competitive challenge, to facilitate systematic evaluation of RAG-based Q&A systems.

Findings

01

Questions exhibit high diversity and a wide range of difficulty levels.

02

The dataset effectively differentiates system capabilities.

03

Ground-truth answers and supporting claims enhance evaluation accuracy.

Abstract

With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

LiveRAG/Benchmark
dataset· 154 dl
154 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems · AI in Service Interactions · Topic Modeling