MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages

Wenhan Han; Yifan Zhang; Zhixun Chen; Binbin Liu; Haobin Lin; Bingni Zhang; Taifeng Wang; Mykola Pechenizkiy; Meng Fang; Yin Zheng

arXiv:2506.19468·cs.CL·June 25, 2025

MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages

Wenhan Han, Yifan Zhang, Zhixun Chen, Binbin Liu, Haobin Lin, Bingni Zhang, Taifeng Wang, Mykola Pechenizkiy, Meng Fang, Yin Zheng

PDF

Open Access 1 Datasets 4 Reviews

TL;DR

MuBench is a comprehensive benchmark evaluating 61 languages to assess multilingual LLMs' capabilities, revealing gaps in language coverage and performance disparities, and proposing metrics and training strategies for improvement.

Contribution

Introduction of MuBench, a multilingual benchmark covering 61 languages, and analysis of LLM performance gaps, along with new metrics and training insights for multilingual models.

Findings

01

Significant gaps between claimed and actual language coverage.

02

Performance disparity between English and low-resource languages.

03

Pretraining strategies influence cross-lingual transfer dynamics.

Abstract

Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage, particularly a persistent performance disparity between English and low-resource languages. Leveraging MuBench's alignment, we propose Multilingual Consistency (MLC) as a complementary metric to accuracy for analyzing performance bottlenecks and guiding model improvement. Finally, we pretrain a suite of 1.2B-parameter models…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

- Ambitious scope and coverage: 61 languages and multiple task categories represent an impressive effort toward comprehensive multilingual evaluation. - Detailed translation pipeline: The multi-stage quality control with semantic, purity, and cultural sensitivity checks is well-structured and thorough. - Cross-lingual alignment and code-switching evaluation: Enables new analyses not possible with existing benchmarks. - Transparency and openness: Dataset availability on Hugging Face improves r

Weaknesses

- Limited novelty: The contribution is primarily engineering and dataset aggregation, not a clear conceptual or methodological innovation beyond existing multilingual benchmarks (e.g., MMLU, BenchMAX, INCLUDE). - Benchmark saturation – Given numerous existing multilingual datasets, the incremental improvement offered by MuBench does not clearly justify publication in a top-tier venue like ICLR. - Evaluation analysis lacks depth: insight into causes or linguistic patterns, error analysis, and

Reviewer 02Rating 4Confidence 4

Strengths

- The benchmark can evaluate language models on more than 60 languages, which can be very useful for those language communities -- as long as the translation is accurate and makes sense (more on that below) - The authors double-check the quality of their translation pipeline with a manual inspection that involved native speakers of 17 languages. - Each sample in the benchmark is annotated with the topic and sub-topic category, which might be very useful metadata for the future use of this datase

Weaknesses

As a native speaker of a lower-resource language, I find the machine-translated "multilingual" benchmarks somewhat troubling. First of all, translationese is a problem even with human translation and much more with machine translation. You end up with a very specific unnatural variant of each languages that relies on English-like linguistic constructions and that might omit language features not present in English. As a result, such benchmarks give overly optimistic scores to English-centric la

Reviewer 03Rating 6Confidence 3

Strengths

- MuBench is broad in scope, spanning numerous languages, tasks, and samples. - The dataset construction pipeline is carefully designed, incorporating checks for semantic consistency, translation purity, and cultural sensitivity. The authors further validate the reliability of the translations through expert evaluation on 34K samples and overlap verification with 100 samples from MMMLU. - The experiments are conducted to evaluate the multilingual capabilities of various LLMs, revealing how cross

Weaknesses

- The tasks in MuBench are mostly binary and multiple-choice formats, overlooking other important multilingual capabilities such as translation, summarization, and instruction following. This restricts the benchmark's overall applicability and impact. - Some interpretive statements lack explicit numerical evidence. For instance, claims such as "Babel and Sailor2 demonstrate notable gains in their targeted language groups" or "smaller models often benefit from the presence of English in mixed-lan

Reviewer 04Rating 4Confidence 5

Strengths

- Paper is well written and is easy to follow through. - The authors covered a huge number of datasets and translated them to 61 languages which had high, medium, low-resource languages. - Experimental setup is clearly explained and results are followed up by human evaluation. - MuBench data collection pipeline looks thorough and has a lot of checks. - Cross lingual consistency evaluation and creating code switched dataset to see performance on code switched data are great contributions

Weaknesses

- I was a bit skeptical from the beginning about the translation quality but the fact that it was human-evaluated in 17 languages was reassuring. However, when I checked those 17 languages, most of them are either medium or high resource languages (61 languages in total out of which highest numbers are in low-resource languages (26)). This is a serious flaw in their paper. Ideally they should have picked an equal number of languages from high, medium, low resource for human evaluation. Existing

Code & Models

Datasets

aialt/MuBench
dataset· 755 dl
755 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling