Generating Leakage-Free Benchmarks for Robust RAG Evaluation
Jiayi Liu, Jiaxing Zhang, Bowen Jin, Jennifer Neville

TL;DR
This paper presents SeedRG, a semi-synthetic benchmark generation pipeline that reduces knowledge leakage in RAG evaluation by creating structurally similar, novel questions that are unlikely to be in the model's parametric memory.
Contribution
SeedRG introduces a novel method for generating robust RAG benchmarks that mitigate knowledge leakage and benchmark aging through reasoning graph extraction and entity replacement.
Findings
SeedRG effectively reduces knowledge leakage in RAG benchmarks.
Generated benchmarks maintain reasoning complexity while being novel.
The approach addresses the problem of benchmark aging in RAG evaluation.
Abstract
Retrieval-augmented generation (RAG) is widely used to augment large language models (LLMs) with external knowledge. However, many benchmark datasets, designed to test RAG performance, comprise many questions that can already be answered from an LLM's parametric memory. This leads to unreliable evaluation. We refer to this phenomenon as knowledge leakage: cases where RAG tasks are solvable without retrieval. This issue worsens over time due to benchmark aging. As benchmarks are reused for training, their contents are increasingly absorbed into model parameters, making them less effective for evaluating retrieval. We introduce SeedRG, a semi-synthetic benchmark generation pipeline that mitigates knowledge leakage and addresses the issue of benchmark aging. Starting from a seed benchmark dataset, SeedRG extracts a reasoning graph from question-context pairs to capture their underlying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
