M-IFEval: Multilingual Instruction-Following Evaluation

Antoine Dussolle; Andrea Carde\~na D\'iaz; Shota Sato; Peter Devine

arXiv:2502.04688·cs.CL·February 10, 2025

M-IFEval: Multilingual Instruction-Following Evaluation

Antoine Dussolle, Andrea Carde\~na D\'iaz, Shota Sato, Peter Devine

PDF

Open Access 1 Repo 1 Models 1 Datasets 1 Video

TL;DR

M-IFEval introduces a multilingual benchmark for evaluating large language models' instruction-following abilities across different languages, highlighting performance variability and the need for diverse linguistic assessments.

Contribution

It extends the existing IFEval benchmark to include French, Japanese, and Spanish, enabling comprehensive multilingual evaluation of LLMs.

Findings

01

Performance varies significantly across languages and instruction types.

02

Multilingual benchmarks are crucial for assessing LLMs in diverse cultural contexts.

03

State-of-the-art LLMs show inconsistent performance across languages.

Abstract

Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages. We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lightblue-tech/M-IFEval
noneOfficial

Models

🤗
LiquidAI/LFM2.5-1.2B-JP
model· 2.5k dl· ♡ 142
2.5k dl♡ 142

Datasets

sbintuitions/voicebench-ja
dataset· 53 dl
53 dl

Videos

M-IFEval: Multilingual Instruction-Following Evaluation· underline

Taxonomy

TopicsEducational Technology and Assessment