Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang

TL;DR
This paper introduces EvalPlus, a rigorous evaluation framework for code generated by LLMs, significantly enhancing existing benchmarks with automated test case generation to better assess true functional correctness.
Contribution
The paper presents EvalPlus, a novel framework that augments code synthesis benchmarks with large-scale automated test cases, revealing more accurate assessments of LLMs' code correctness.
Findings
EvalPlus increased test coverage by 80x on HumanEval+.
Significant reduction in pass@k metrics, up to 28.9%.
Prior benchmarks overestimated LLM performance due to test insufficiency.
Abstract
Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗RedHatAI/starcoder2-15b-quantized.w8a16model· 17 dl17 dl
- 🤗RedHatAI/starcoder2-3b-quantized.w8a16model· 16 dl16 dl
- 🤗RedHatAI/starcoder2-7b-quantized.w8a16model· 10 dl10 dl
- 🤗RedHatAI/starcoder2-3b-quantized.w8a8model· 13 dl13 dl
- 🤗RedHatAI/starcoder2-7b-quantized.w8a8model· 9 dl9 dl
- 🤗RedHatAI/starcoder2-15b-quantized.w8a8model· 10 dl10 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software Testing and Debugging Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization
