Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of   Large Language Models for Code Generation

Jiawei Liu; Chunqiu Steven Xia; Yuyao Wang; Lingming Zhang

arXiv:2305.01210·cs.SE·November 1, 2023·172 cites

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang

PDF

Open Access 1 Repo 6 Models

TL;DR

This paper introduces EvalPlus, a rigorous evaluation framework for code generated by LLMs, significantly enhancing existing benchmarks with automated test case generation to better assess true functional correctness.

Contribution

The paper presents EvalPlus, a novel framework that augments code synthesis benchmarks with large-scale automated test cases, revealing more accurate assessments of LLMs' code correctness.

Findings

01

EvalPlus increased test coverage by 80x on HumanEval+.

02

Significant reduction in pass@k metrics, up to 28.9%.

03

Prior benchmarks overestimated LLM performance due to test insufficiency.

Abstract

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

evalplus/evalplus
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Software Testing and Debugging Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization