MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

Seyoung Song; Seogyeong Jeong; Eunsu Kim; Jiho Jin; Dongkwan Kim; Jay Shin; Alice Oh

arXiv:2505.14395·cs.CL·November 11, 2025

MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

Seyoung Song, Seogyeong Jeong, Eunsu Kim, Jiho Jin, Dongkwan Kim, Jay Shin, Alice Oh

PDF

Open Access 1 Repo 1 Video

TL;DR

MUG-Eval is a versatile framework that assesses multilingual capabilities of large language models by transforming benchmarks into conversational tasks, enabling resource-efficient evaluation across many languages without language-specific tools.

Contribution

It introduces a language-agnostic, proxy evaluation method for multilingual LLMs that correlates well with existing benchmarks and works across diverse resource settings.

Findings

01

Strong correlation with established benchmarks ($r$ > 0.75)

02

Effective across 30 languages including low-resource ones

03

Independent of language-specific NLP tools or annotated datasets

Abstract

Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs' multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs' accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy for successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seyoungsong/mugeval
noneOfficial

Videos

MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language· underline

Taxonomy

TopicsNatural Language Processing Techniques