ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan, Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, Yiling Lou

TL;DR
This paper introduces ClassEval, a manually crafted benchmark for evaluating large language models on class-level Python code generation, revealing significant performance gaps compared to method-level tasks and analyzing model capabilities and strategies.
Contribution
The paper presents the first class-level code generation benchmark, ClassEval, and provides a comprehensive study of 11 state-of-the-art LLMs on this challenging task.
Findings
All LLMs perform worse on class-level than method-level code generation.
GPT-4 and GPT-3.5 outperform other models significantly.
Holistic generation works best for GPT-4 and GPT-3.5, while incremental methods suit smaller models.
Abstract
In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Machine Learning and Data Classification · Web Data Mining and Analysis
MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Byte Pair Encoding · Multi-Head Attention · Weight Decay · Position-Wise Feed-Forward Layer · 15 Ways to Contact How can i speak to someone at Delta Airlines · Softmax
