McEval: Massively Multilingual Code Evaluation

Linzheng Chai; Shukai Liu; Jian Yang; Yuwei Yin; Ke Jin; Jiaheng Liu,; Tao Sun; Ge Zhang; Changyu Ren; Hongcheng Guo; Zekun Wang; Boyang Wang,; Xianjie Wu; Bing Wang; Tongliang Li; Liqun Yang; Sufeng Duan; Zhoujun Li

arXiv:2406.07436·cs.PL·June 12, 2024·1 cites

McEval: Massively Multilingual Code Evaluation

Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, Ke Jin, Jiaheng Liu,, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, Zekun Wang, Boyang Wang,, Xianjie Wu, Bing Wang, Tongliang Li, Liqun Yang, Sufeng Duan, Zhoujun Li

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

McEval is a comprehensive multilingual code benchmark covering 40 programming languages with 16K test samples, designed to evaluate and improve the capabilities of code large language models across diverse languages.

Contribution

The paper introduces McEval, a massively multilingual code evaluation benchmark with curated instruction data and a multilingual code generator, advancing multilingual code understanding research.

Findings

01

Open-source models lag behind GPT-series in multilingual code tasks.

02

McEval reveals significant challenges in multilingual code understanding.

03

Multilingual instruction corpora improve code model performance.

Abstract

Code large language models (LLMs) have shown remarkable advances in code understanding, completion, and generation tasks. Programming benchmarks, comprised of a selection of code challenges and corresponding test cases, serve as a standard to evaluate the capability of different LLMs in such tasks. However, most existing benchmarks primarily focus on Python and are still restricted to a limited number of languages, where other languages are translated from the Python samples (e.g. MultiPL-E) degrading the data diversity. To further facilitate the research of code LLMs, we propose a massively multilingual code benchmark covering 40 programming languages (McEval) with 16K test samples, which substantially pushes the limits of code LLMs in multilingual scenarios. The benchmark contains challenging code completion, understanding, and generation evaluation tasks with finely curated massively…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 5

Strengths

The paper is well-written and is very easy to read and follow. The paper is also a significant leap towards better evaluation for code models. Current models are mostly evaluated on smaller benchmarks like HumanEvalPack or MBPP which have very few languages (<10) and this limits the ability to effectively evaluate code models. The authors' work is quite significant since it introduces a new benchmark with 40 programming languages which is more than 4x of any previous benchmarks. The paper also

Weaknesses

The instruction corpora is generated using GPT-4 and thus is commercially unusable. I understand that collecting good human annotated instructions is a significant undertaking and might be infeasible.

Reviewer 02Rating 6Confidence 4

Strengths

(1) Extensive study and effort has been spent in generating and documenting this code benchmark. (2) While HumanEval and MBPP are the popular code programming benchmarks today, Code LM community needs additional multilingual programming benchmark. Hence, this paper addresses a problem statement that is of demand today.

Weaknesses

GPT models, like, gpt-4-1106-preview, has been used to generate the problem description for code instruction corpora. That could have influenced GPT models to lead in the benchmarks with significant performance margins over other models.

Reviewer 03Rating 6Confidence 5

Strengths

1. This is truly a BIG project. The authors have made significant contributions to the field of multilingual code generation.

Weaknesses

1. Although this work makes a significant contribution to benchmarking the multilingual code generation capabilities of LLMs, its main contribution is labor-intensive, with limited technical contributions or insights (MCEVAL-INSTRUCT looks similar to MagiCoder by generating instruction data from code snippets). 2. Is it meaningful to benchmark code LLMs on 40 languages? Can the author elaborate on why it is important to measure the generation capabilities of code LLMs in 40 languages simultaneou

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques