Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models

Bo Zeng; Chenyang Lyu; Sinuo Liu; Mingyan Zeng; Minghao Wu; Xuanfan Ni; Tianqi Shi; Yu Zhao; Yefeng Liu; Chenyu Zhu; Ruizhe Li; Jiahui Geng; Qing Li; Yu Tong; Longyue Wang; Weihua Luo; Kaifu Zhang

arXiv:2507.11882·cs.CL·July 17, 2025

Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models

Bo Zeng, Chenyang Lyu, Sinuo Liu, Mingyan Zeng, Minghao Wu, Xuanfan Ni, Tianqi Shi, Yu Zhao, Yefeng Liu, Chenyu Zhu, Ruizhe Li, Jiahui Geng, Qing Li, Yu Tong, Longyue Wang, Weihua Luo, Kaifu Zhang

PDF

Open Access 2 Datasets

TL;DR

This paper introduces Marco-Bench-MIF, a multilingual benchmark for evaluating large language models' instruction-following abilities across 30 languages, addressing linguistic and cultural variations with a hybrid translation and verification pipeline.

Contribution

It extends existing benchmarks to a multilingual, culturally-aware dataset covering 30 languages, enabling comprehensive evaluation of LLMs' multilingual instruction-following capabilities.

Findings

01

25-35% accuracy gap between high/low-resource languages

02

Model scale impacts performance by 45-60%

03

Machine-translated data underestimates accuracy by 7-22%

Abstract

Instruction-following capability has become a major ability to be evaluated for Large Language Models (LLMs). However, existing datasets, such as IFEval, are either predominantly monolingual and centered on English or simply machine translated to other languages, limiting their applicability in multilingual contexts. In this paper, we present an carefully-curated extension of IFEval to a localized multilingual version named Marco-Bench-MIF, covering 30 languages with varying levels of localization. Our benchmark addresses linguistic constraints (e.g., modifying capitalization requirements for Chinese) and cultural references (e.g., substituting region-specific company names in prompts) via a hybrid pipeline combining translation with verification. Through comprehensive evaluation of 20+ LLMs on our Marco-Bench-MIF, we found that: (1) 25-35% accuracy gap between high/low-resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques