TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Wenhan Wang; Chenyuan Yang; Zhijie Wang; Yuheng Huang; Zhaoyang Chu,; Da Song; Lingming Zhang; An Ran Chen; Lei Ma

arXiv:2406.04531·cs.SE·February 4, 2025·3 cites

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu,, Da Song, Lingming Zhang, An Ran Chen, Lei Ma

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces TESTEVAL, a comprehensive benchmark for evaluating large language models' ability to generate test cases for Python programs, revealing current limitations in understanding program logic.

Contribution

We created a new benchmark dataset and evaluation framework for LLMs in test case generation, enabling fair comparison across models.

Findings

01

LLMs struggle with targeted coverage tasks

02

Current models have limited understanding of program logic

03

Benchmark datasets are now publicly available

Abstract

Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program under test. Recently, researchers have recognized the potential of large language models (LLMs) in software testing. However, there remains a lack of fair comparisons between different LLMs in terms of test case generation capabilities. In this paper, we propose TESTEVAL, a novel benchmark for test case generation with LLMs. We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We further evaluate sixteen popular LLMs, including both commercial and open-source ones, on TESTEVAL. We find that generating test cases to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm4softwaretesting/testeval
noneOfficial

Videos

TESTEVAL: Benchmarking Large Language Models for Test Case Generation· underline

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software System Performance and Reliability · Software Engineering Research