PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

Zhehao Tan; Yihan Jiao; Dan Yang; Lei Liu; Jie Feng; Duolin Sun; Yue Shen; Jian Wang; Peng Wei; Jinjie Gu

arXiv:2507.22927·cs.CL·August 1, 2025

PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

Zhehao Tan, Yihan Jiao, Dan Yang, Lei Liu, Jie Feng, Duolin Sun, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu

PDF

Open Access 1 Datasets

TL;DR

The PRGB Benchmark introduces a detailed, multi-level evaluation framework for RAG systems, focusing on LLM-specific capabilities and the role of external knowledge, to improve reliability and efficiency.

Contribution

It presents a novel placeholder-based evaluation approach and a comprehensive benchmark for assessing LLMs in RAG systems at multiple granular levels.

Findings

01

Current LLMs show limitations in error resilience and context faithfulness in RAG.

02

The benchmark reveals specific weaknesses in representative LLMs' generation capabilities.

03

The framework enables systematic, reproducible evaluation of RAG system components.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs' roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AQ-MedAI/PRGB
dataset· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression