LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation
David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Alex Shtoff, Oren Somekh, Ran Tavory

TL;DR
The paper introduces LiveRAG, a comprehensive synthetic dataset with varying difficulty levels for evaluating Retrieval Augmented Generation systems, aiding systematic assessment and development of more robust Q&A models.
Contribution
It presents a new benchmark dataset with difficulty annotations, derived from a competitive challenge, to facilitate systematic evaluation of RAG-based Q&A systems.
Findings
Questions exhibit high diversity and a wide range of difficulty levels.
The dataset effectively differentiates system capabilities.
Ground-truth answers and supporting claims enhance evaluation accuracy.
Abstract
With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · AI in Service Interactions · Topic Modeling
