MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages
Wenhan Han, Yifan Zhang, Zhixun Chen, Binbin Liu, Haobin Lin, Bingni Zhang, Taifeng Wang, Mykola Pechenizkiy, Meng Fang, Yin Zheng

TL;DR
MuBench is a comprehensive benchmark evaluating 61 languages to assess multilingual LLMs' capabilities, revealing gaps in language coverage and performance disparities, and proposing metrics and training strategies for improvement.
Contribution
Introduction of MuBench, a multilingual benchmark covering 61 languages, and analysis of LLM performance gaps, along with new metrics and training insights for multilingual models.
Findings
Significant gaps between claimed and actual language coverage.
Performance disparity between English and low-resource languages.
Pretraining strategies influence cross-lingual transfer dynamics.
Abstract
Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage, particularly a persistent performance disparity between English and low-resource languages. Leveraging MuBench's alignment, we propose Multilingual Consistency (MLC) as a complementary metric to accuracy for analyzing performance bottlenecks and guiding model improvement. Finally, we pretrain a suite of 1.2B-parameter models…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Ambitious scope and coverage: 61 languages and multiple task categories represent an impressive effort toward comprehensive multilingual evaluation. - Detailed translation pipeline: The multi-stage quality control with semantic, purity, and cultural sensitivity checks is well-structured and thorough. - Cross-lingual alignment and code-switching evaluation: Enables new analyses not possible with existing benchmarks. - Transparency and openness: Dataset availability on Hugging Face improves r
- Limited novelty: The contribution is primarily engineering and dataset aggregation, not a clear conceptual or methodological innovation beyond existing multilingual benchmarks (e.g., MMLU, BenchMAX, INCLUDE). - Benchmark saturation – Given numerous existing multilingual datasets, the incremental improvement offered by MuBench does not clearly justify publication in a top-tier venue like ICLR. - Evaluation analysis lacks depth: insight into causes or linguistic patterns, error analysis, and
- The benchmark can evaluate language models on more than 60 languages, which can be very useful for those language communities -- as long as the translation is accurate and makes sense (more on that below) - The authors double-check the quality of their translation pipeline with a manual inspection that involved native speakers of 17 languages. - Each sample in the benchmark is annotated with the topic and sub-topic category, which might be very useful metadata for the future use of this datase
As a native speaker of a lower-resource language, I find the machine-translated "multilingual" benchmarks somewhat troubling. First of all, translationese is a problem even with human translation and much more with machine translation. You end up with a very specific unnatural variant of each languages that relies on English-like linguistic constructions and that might omit language features not present in English. As a result, such benchmarks give overly optimistic scores to English-centric la
- MuBench is broad in scope, spanning numerous languages, tasks, and samples. - The dataset construction pipeline is carefully designed, incorporating checks for semantic consistency, translation purity, and cultural sensitivity. The authors further validate the reliability of the translations through expert evaluation on 34K samples and overlap verification with 100 samples from MMMLU. - The experiments are conducted to evaluate the multilingual capabilities of various LLMs, revealing how cross
- The tasks in MuBench are mostly binary and multiple-choice formats, overlooking other important multilingual capabilities such as translation, summarization, and instruction following. This restricts the benchmark's overall applicability and impact. - Some interpretive statements lack explicit numerical evidence. For instance, claims such as "Babel and Sailor2 demonstrate notable gains in their targeted language groups" or "smaller models often benefit from the presence of English in mixed-lan
- Paper is well written and is easy to follow through. - The authors covered a huge number of datasets and translated them to 61 languages which had high, medium, low-resource languages. - Experimental setup is clearly explained and results are followed up by human evaluation. - MuBench data collection pipeline looks thorough and has a lot of checks. - Cross lingual consistency evaluation and creating code switched dataset to see performance on code switched data are great contributions
- I was a bit skeptical from the beginning about the translation quality but the fact that it was human-evaluated in 17 languages was reassuring. However, when I checked those 17 languages, most of them are either medium or high resource languages (61 languages in total out of which highest numbers are in low-resource languages (26)). This is a serious flaw in their paper. Ideally they should have picked an equal number of languages from high, medium, low resource for human evaluation. Existing
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
