Multi-lingual Functional Evaluation for Large Language Models
Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian

TL;DR
This paper introduces multi-lingual functional benchmarks for large language models, revealing significant performance and robustness variations across languages and benchmarks, highlighting limitations of static data evaluations.
Contribution
The paper creates new multi-lingual functional benchmarks by translating existing templates into five diverse languages, providing a more practical assessment of model performance.
Findings
Static benchmarks often overestimate model performance across languages.
Model robustness varies significantly between languages.
Some languages like Arabic and English show consistent performance.
Abstract
Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
