MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Mar\'ia Andrea Cruz Bland\'on; Jayasimha Talur; Bruno Charron; Dong Liu; Saab Mansour; Marcello Federico

arXiv:2502.17163·cs.CL·July 22, 2025

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Mar\'ia Andrea Cruz Bland\'on, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico

PDF

Open Access 1 Repo

TL;DR

MEMERAG is a multilingual benchmark for evaluating retrieval augmented generation systems, using native-language data and expert annotations to better reflect cultural nuances and improve automatic evaluation methods.

Contribution

It introduces a multilingual, end-to-end meta-evaluation benchmark based on native-language data, enhancing the assessment of RAG systems across diverse languages.

Findings

01

High inter-annotator agreement in faithfulness and relevance assessments

02

Multilingual LLMs show varying performance across languages

03

Benchmark effectively identifies improvements from advanced prompting techniques

Abstract

Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience. In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG). Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/memerag
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Linear Layer · Layer Normalization · Byte Pair Encoding · WordPiece · Dense Connections · Attention Dropout · Residual Connection