Measuring Coding Challenge Competence With APPS

Dan Hendrycks; Steven Basart; Saurav Kadavath; Mantas Mazeika; and Akul Arora; Ethan Guo; Collin Burns; Samir Puranik; Horace He; and Dawn Song; Jacob Steinhardt

arXiv:2105.09938·cs.SE·November 10, 2021·138 cites

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, and Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, and Dawn Song, Jacob Steinhardt

PDF

Open Access 3 Repos 5 Datasets

TL;DR

This paper introduces APPS, a comprehensive benchmark with 10,000 Python problems to evaluate and track the progress of machine learning models in code generation from natural language specifications.

Contribution

The paper presents APPS, a new benchmark for code generation that measures models' ability to generate correct Python code from natural language descriptions, including a large problem set and evaluation methodology.

Findings

01

Models like GPT-Neo pass about 20% of introductory test cases.

02

Syntax errors in generated code decrease exponentially with model improvements.

03

Machine learning models are beginning to learn how to code.

Abstract

While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Machine Learning and Data Classification

MethodsGPT-Neo