Multi-lingual Evaluation of Code Generation Models

Ben Athiwaratkun; Sanjay Krishna Gouda; Zijian Wang; Xiaopeng Li,; Yuchen Tian; Ming Tan; Wasi Uddin Ahmad; Shiqi Wang; Qing Sun; Mingyue Shang,; Sujan Kumar Gonugondla; Hantian Ding; Varun Kumar; Nathan Fulton; Arash; Farahani; Siddhartha Jain; Robert Giaquinto; Haifeng Qian; Murali Krishna; Ramanathan; Ramesh Nallapati; Baishakhi Ray; Parminder Bhatia; Sudipta; Sengupta; Dan Roth; Bing Xiang

arXiv:2210.14868·cs.LG·March 30, 2023·28 cites

Multi-lingual Evaluation of Code Generation Models

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li,, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang,, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash, Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian

PDF

Open Access 2 Repos 1 Datasets 1 Video

TL;DR

This paper introduces new multilingual benchmarks for code generation models, enabling evaluation across multiple programming languages and demonstrating models' generalization, translation, and few-shot learning capabilities.

Contribution

The authors present MBXP, Multilingual HumanEval, and MathQA-X datasets, along with a scalable conversion framework for multilingual code evaluation, advancing the assessment of language models' coding abilities.

Findings

01

Multilingual models outperform monolingual models.

02

Few-shot prompting enables learning new languages.

03

Models exhibit zero-shot translation capabilities.

Abstract

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

AmazonScience/mxeval
dataset· 216 dl
216 dl

Videos

Multi-lingual Evaluation of Code Generation Models· slideslive

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Machine Learning and Data Classification

MethodsTest