MIRAGE-Bench: Automatic Multilingual Benchmark Arena for   Retrieval-Augmented Generation Systems

Nandan Thakur; Suleman Kazi; Ge Luo; Jimmy Lin; Amin Ahmad

arXiv:2410.13716·cs.CL·April 1, 2025

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

Nandan Thakur, Suleman Kazi, Ge Luo, Jimmy Lin, Amin Ahmad

PDF

Open Access 1 Repo

TL;DR

MIRAGE-Bench introduces a multilingual, arena-based benchmark for retrieval-augmented generation systems that combines heuristic metrics with large language model judgments, enabling efficient and reliable evaluation across 18 languages.

Contribution

The paper presents a novel surrogate judge trained on heuristic metrics to predict LLM-based judgments, reducing reliance on expensive LLM evaluations in multilingual RAG benchmarking.

Findings

01

High correlation (Kendall Tau = 0.909) between surrogate judge and GPT-4o evaluations.

02

Proprietary and open-source LLMs outperform others on MIRAGE-Bench.

03

The benchmark covers 18 languages and evaluates 19 multilingual LLMs.

Abstract

Traditional retrieval-augmented generation (RAG) benchmarks evaluate systems using heuristic-based metrics, but these require human preferences as the ground truth for reference. In contrast, arena-based benchmarks, where systems compete against each other, require an expensive large language model (LLM) as a judge for a reliable evaluation. We present a simple efficient technique to combine the best of both worlds. The idea is to train a surrogate judge using heuristic metrics as input, to output the LLM as a judge prediction. In our work, we develop MIRAGE-Bench, a synthetic arena-based RAG benchmark for 18 diverse languages on Wikipedia focused on multilingual answer generation evaluation. It extensively couples both heuristic features and LLM as a judge for evaluation. We benchmark 19 multilingual LLMs, and observe a high correlation (Kendall Tau ( $τ$ ) = 0.909) using our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vectara/mirage-bench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Byte Pair Encoding · Softmax · Multi-Head Attention · WordPiece · Dropout · Layer Normalization · Adam · Attention Dropout