IRB: Automated Generation of Robust Factuality Benchmarks
Lam Thanh Do, Bhagyashree Taleka, Hozaifa Ammar Bhutta, Vikram Sharma Mailthody, Kevin Chen-Chuan Chang, Wen-mei Hwu

TL;DR
IRB is an automated framework for creating robust factuality benchmarks for RAG systems, revealing challenges for large language models and emphasizing the importance of retrieval quality over scale.
Contribution
We introduce IRB, a novel automated benchmark generation framework that enhances robustness and reduces manual effort in evaluating RAG system factuality.
Findings
IRB-generated benchmarks challenge frontier LLMs in closed-book settings.
Reasoning LLMs demonstrate higher reliability in factuality assessments.
Improving retrieval components can be more cost-effective than scaling generators.
Abstract
Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Information Retrieval and Search Behavior · Multimodal Machine Learning Applications
