Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Zheyuan Yang; Zexi Kuang; Xue Xia; Yilun Zhao

arXiv:2506.12278·cs.SE·June 17, 2025

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Zheyuan Yang, Zexi Kuang, Xue Xia, Yilun Zhao

PDF

Open Access 1 Video

TL;DR

This paper introduces TestCase-Eval, a comprehensive benchmark for evaluating how effectively large language models generate test cases for algorithm problems, focusing on fault coverage and exposure.

Contribution

It presents a new benchmark with a systematic evaluation framework for assessing LLMs' ability to generate high-quality test cases for algorithm problems.

Findings

01

LLMs vary significantly in fault coverage and exposure capabilities.

02

Proprietary LLMs outperform open-source models in test case generation.

03

The benchmark reveals specific strengths and limitations of current LLMs.

Abstract

We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure· underline

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Machine Learning and Data Classification