CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context,   and Verification

Jiacheng Xu; Bo Pang; Jin Qu; Hiroaki Hayashi; Caiming Xiong; Yingbo; Zhou

arXiv:2502.08806·cs.SE·February 14, 2025

CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification

Jiacheng Xu, Bo Pang, Jin Qu, Hiroaki Hayashi, Caiming Xiong, Yingbo, Zhou

PDF

Open Access

TL;DR

CLOVER is a comprehensive benchmark designed to evaluate AI models' ability to generate and complete test cases in software testing, emphasizing coverage, long-context understanding, and verification across multiple Python repositories.

Contribution

This paper introduces CLOVER, a novel benchmark with diverse test case tasks and a retrieval-based context construction method, highlighting current model limitations and potential for advancement.

Findings

01

Models perform similarly with short contexts but differ with 16k contexts.

02

GPT-4o and Claude 3.5 effectively leverage relevant snippets.

03

All models score below 35% on complex tasks even with oracle context.

Abstract

Software testing is a critical aspect of software development, yet generating test cases remains a routine task for engineers. This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases under specific conditions. Spanning from simple assertion completions to writing test cases that cover specific code blocks across multiple files, these tasks are based on 12 python repositories, analyzing 845 problems with context lengths ranging from 4k to 128k tokens. Utilizing code testing frameworks, we propose a method to construct retrieval contexts using coverage information. While models exhibit comparable performance with short contexts, notable differences emerge with 16k contexts. Notably, models like GPT-4o and Claude 3.5 can effectively leverage relevant snippets; however, all models score below 35\% on the complex Task III, even with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Reliability and Analysis Research · Software System Performance and Reliability