Program Synthesis with Large Language Models

Jacob Austin; Augustus Odena; Maxwell Nye; Maarten Bosma; Henryk; Michalewski; David Dohan; Ellen Jiang; Carrie Cai; Michael Terry; Quoc Le,; Charles Sutton

arXiv:2108.07732·cs.PL·August 18, 2021·28 cites

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk, Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le,, Charles Sutton

PDF

Open Access 1 Repo 10 Models 5 Datasets

TL;DR

This paper evaluates large language models' ability to synthesize Python programs from natural language, demonstrating that larger models perform better and that fine-tuning and human feedback significantly improve code generation accuracy.

Contribution

The study provides a comprehensive analysis of large language models for program synthesis, introducing new benchmarks and examining the effects of size, fine-tuning, and human feedback on performance.

Findings

01

Performance scales log-linearly with model size

02

Largest models achieve 59.6% accuracy on MBPP with few-shot learning

03

Fine-tuning improves accuracy by about 10 percentage points

Abstract

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/tracecodegen
pytorch

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Topic Modeling