mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Nishat Raihan; Antonios Anastasopoulos; Marcos Zampieri

arXiv:2410.15037·cs.CL·May 19, 2025·2 cites

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

mHumanEval is a new multilingual benchmark with over 200 natural languages designed to evaluate large language models' ability to generate code from diverse linguistic prompts, addressing limitations of previous English-centric benchmarks.

Contribution

It introduces a comprehensive multilingual benchmark for code generation, including expert translations for 15 languages and analysis of SOTA models' cross-lingual capabilities.

Findings

01

Supports prompts in over 200 languages

02

Includes expert translations for 15 languages

03

Provides insights into multilingual code generation performance

Abstract

Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mraihan-gmu/mHumanEval-Benchmark
noneOfficial

Datasets

md-nishat-008/mHumanEval-Benchmark
dataset· 88k dl
88k dl

Videos

mHumanEval - A Multilingual Benchmark to Evaluate Large Language Models for Code Generation· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus