BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large   Language Models

Xu Huang; Wenhao Zhu; Hanxu Hu; Conghui He; Lei Li; Shujian Huang; Fei; Yuan

arXiv:2502.07346·cs.CL·April 22, 2025

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei, Yuan

PDF

Open Access 1 Repo 5 Datasets 1 Video

TL;DR

BenchMAX is a new multilingual evaluation suite designed to assess advanced capabilities of large language models across 16 languages, highlighting performance gaps and promoting development.

Contribution

It introduces a comprehensive, high-quality multilingual benchmark with native annotations, addressing a gap in evaluating instruction following, reasoning, and code generation across languages.

Findings

01

Performance varies significantly across languages.

02

Scaling model size alone does not close capability gaps.

03

The benchmark reveals specific strengths and weaknesses of LLMs in multilingual settings.

Abstract

Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measuring these advanced capabilities across languages is underexplored. To address the disparity, we introduce BenchMAX, a multi-way multilingual evaluation benchmark that allows for fair comparisons of these important abilities across languages. To maintain high quality, three distinct native-speaking annotators independently annotate each sample within all tasks after the data was machine-translated from English into 16 other languages. Additionally, we present a novel translation challenge stemming from dataset construction. Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages, highlighting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cone-mt/benchmax
noneOfficial

Datasets

Videos

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsFocus