Can we Evaluate RAGs with Synthetic Data?

Jonas van Elburg; Peter van der Putten; Maarten Marx

arXiv:2508.11758·cs.CL·October 22, 2025

Can we Evaluate RAGs with Synthetic Data?

Jonas van Elburg, Peter van der Putten, Maarten Marx

PDF

TL;DR

This paper explores the effectiveness of using synthetic question-answer data generated by large language models as a substitute for human-labeled benchmarks in evaluating retrieval-augmented generation systems, finding it reliable for some comparisons but not all.

Contribution

It provides an empirical assessment of synthetic data as a benchmarking tool for RAG systems, highlighting its strengths and limitations across different configurations.

Findings

01

Synthetic benchmarks reliably rank RAGs by retriever configuration.

02

Synthetic benchmarks do not consistently rank generator architectures.

03

Task mismatch and stylistic bias affect synthetic benchmark reliability.

Abstract

We investigate whether synthetic question-answer (QA) data generated by large language models (LLMs) can serve as an effective proxy for human-labeled benchmarks when the latter is unavailable. We assess the reliability of synthetic benchmarks across two experiments: one varying retriever parameters while keeping the generator fixed, and another varying the generator with fixed retriever parameters. Across four datasets, of which two open-domain and two proprietary, we find that synthetic benchmarks reliably rank the RAGs varying in terms of retriever configuration, aligning well with human-labeled benchmark baselines. However, they do not consistently produce reliable RAG rankings when comparing generator architectures. The breakdown possibly arises from a combination of task mismatch between the synthetic and human benchmarks, and stylistic bias favoring certain generators.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.