ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on   Class-level Code Generation

Xueying Du; Mingwei Liu; Kaixin Wang; Hanlin Wang; Junwei Liu; Yixuan; Chen; Jiayi Feng; Chaofeng Sha; Xin Peng; Yiling Lou

arXiv:2308.01861·cs.CL·August 15, 2023·21 cites

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan, Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, Yiling Lou

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper introduces ClassEval, a manually crafted benchmark for evaluating large language models on class-level Python code generation, revealing significant performance gaps compared to method-level tasks and analyzing model capabilities and strategies.

Contribution

The paper presents the first class-level code generation benchmark, ClassEval, and provides a comprehensive study of 11 state-of-the-art LLMs on this challenging task.

Findings

01

All LLMs perform worse on class-level than method-level code generation.

02

GPT-4 and GPT-3.5 outperform other models significantly.

03

Holistic generation works best for GPT-4 and GPT-3.5, while incremental methods suit smaller models.

Abstract

In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

FudanSELab/ClassEval
dataset· 3.9k dl
3.9k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Machine Learning and Data Classification · Web Data Mining and Analysis

MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Byte Pair Encoding · Multi-Head Attention · Weight Decay · Position-Wise Feed-Forward Layer · 15 Ways to Contact How can i speak to someone at Delta Airlines · Softmax