Multi-lingual Functional Evaluation for Large Language Models

Victor Ojewale; Inioluwa Deborah Raji; Suresh Venkatasubramanian

arXiv:2506.20793·cs.CL·March 13, 2026

Multi-lingual Functional Evaluation for Large Language Models

Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian

PDF

Open Access 2 Datasets

TL;DR

This paper introduces multi-lingual functional benchmarks for large language models, revealing significant performance and robustness variations across languages and benchmarks, highlighting limitations of static data evaluations.

Contribution

The paper creates new multi-lingual functional benchmarks by translating existing templates into five diverse languages, providing a more practical assessment of model performance.

Findings

01

Static benchmarks often overestimate model performance across languages.

02

Model robustness varies significantly between languages.

03

Some languages like Arabic and English show consistent performance.

Abstract

Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling