TESTEVAL: Benchmarking Large Language Models for Test Case Generation
Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu,, Da Song, Lingming Zhang, An Ran Chen, Lei Ma

TL;DR
This paper introduces TESTEVAL, a comprehensive benchmark for evaluating large language models' ability to generate test cases for Python programs, revealing current limitations in understanding program logic.
Contribution
We created a new benchmark dataset and evaluation framework for LLMs in test case generation, enabling fair comparison across models.
Findings
LLMs struggle with targeted coverage tasks
Current models have limited understanding of program logic
Benchmark datasets are now publicly available
Abstract
Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program under test. Recently, researchers have recognized the potential of large language models (LLMs) in software testing. However, there remains a lack of fair comparisons between different LLMs in terms of test case generation capabilities. In this paper, we propose TESTEVAL, a novel benchmark for test case generation with LLMs. We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We further evaluate sixteen popular LLMs, including both commercial and open-source ones, on TESTEVAL. We find that generating test cases to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software System Performance and Reliability · Software Engineering Research
