MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
Mar\'ia Andrea Cruz Bland\'on, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico

TL;DR
MEMERAG is a multilingual benchmark for evaluating retrieval augmented generation systems, using native-language data and expert annotations to better reflect cultural nuances and improve automatic evaluation methods.
Contribution
It introduces a multilingual, end-to-end meta-evaluation benchmark based on native-language data, enhancing the assessment of RAG systems across diverse languages.
Findings
High inter-annotator agreement in faithfulness and relevance assessments
Multilingual LLMs show varying performance across languages
Benchmark effectively identifies improvements from advanced prompting techniques
Abstract
Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience. In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG). Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Linear Layer · Layer Normalization · Byte Pair Encoding · WordPiece · Dense Connections · Attention Dropout · Residual Connection
