mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation
Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri

TL;DR
mHumanEval is a new multilingual benchmark with over 200 natural languages designed to evaluate large language models' ability to generate code from diverse linguistic prompts, addressing limitations of previous English-centric benchmarks.
Contribution
It introduces a comprehensive multilingual benchmark for code generation, including expert translations for 15 languages and analysis of SOTA models' cross-lingual capabilities.
Findings
Supports prompts in over 200 languages
Includes expert translations for 15 languages
Provides insights into multilingual code generation performance
Abstract
Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsFocus
