IRB: Automated Generation of Robust Factuality Benchmarks

Lam Thanh Do; Bhagyashree Taleka; Hozaifa Ammar Bhutta; Vikram Sharma Mailthody; Kevin Chen-Chuan Chang; Wen-mei Hwu

arXiv:2602.08070·cs.IR·February 10, 2026

IRB: Automated Generation of Robust Factuality Benchmarks

Lam Thanh Do, Bhagyashree Taleka, Hozaifa Ammar Bhutta, Vikram Sharma Mailthody, Kevin Chen-Chuan Chang, Wen-mei Hwu

PDF

Open Access

TL;DR

IRB is an automated framework for creating robust factuality benchmarks for RAG systems, revealing challenges for large language models and emphasizing the importance of retrieval quality over scale.

Contribution

We introduce IRB, a novel automated benchmark generation framework that enhances robustness and reduces manual effort in evaluating RAG system factuality.

Findings

01

IRB-generated benchmarks challenge frontier LLMs in closed-book settings.

02

Reasoning LLMs demonstrate higher reliability in factuality assessments.

03

Improving retrieval components can be more cost-effective than scaling generators.

Abstract

Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Information Retrieval and Search Behavior · Multimodal Machine Learning Applications