M-IFEval: Multilingual Instruction-Following Evaluation
Antoine Dussolle, Andrea Carde\~na D\'iaz, Shota Sato, Peter Devine

TL;DR
M-IFEval introduces a multilingual benchmark for evaluating large language models' instruction-following abilities across different languages, highlighting performance variability and the need for diverse linguistic assessments.
Contribution
It extends the existing IFEval benchmark to include French, Japanese, and Spanish, enabling comprehensive multilingual evaluation of LLMs.
Findings
Performance varies significantly across languages and instruction types.
Multilingual benchmarks are crucial for assessing LLMs in diverse cultural contexts.
State-of-the-art LLMs show inconsistent performance across languages.
Abstract
Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages. We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEducational Technology and Assessment
