Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

Wenxuan Shen; Mingjia Wang; Yaochen Wang; Dongping Chen; Junjie Yang; Yao Wan; Weiwei Lin

arXiv:2508.03644·cs.CL·August 6, 2025

Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

Wenxuan Shen, Mingjia Wang, Yaochen Wang, Dongping Chen, Junjie Yang, Yao Wan, Weiwei Lin

PDF

1 Datasets

TL;DR

This paper introduces Double-Bench, a comprehensive, multilingual, multimodal evaluation system for document RAG systems, addressing current evaluation limitations and revealing key insights into model performance and over-confidence issues.

Contribution

The paper presents Double-Bench, a large-scale, fine-grained evaluation framework for document RAG systems, incorporating diverse data and human-verified queries to improve assessment accuracy.

Findings

01

Embedding models are converging in performance.

02

Current RAG frameworks often overestimate evidence support.

03

Stronger retrieval models are needed for better document understanding.

Abstract

Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Episoode/Double-Bench
dataset· 240 dl
240 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.