Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure
Zheyuan Yang, Zexi Kuang, Xue Xia, Yilun Zhao

TL;DR
This paper introduces TestCase-Eval, a comprehensive benchmark for evaluating how effectively large language models generate test cases for algorithm problems, focusing on fault coverage and exposure.
Contribution
It presents a new benchmark with a systematic evaluation framework for assessing LLMs' ability to generate high-quality test cases for algorithm problems.
Findings
LLMs vary significantly in fault coverage and exposure capabilities.
Proprietary LLMs outperform open-source models in test case generation.
The benchmark reveals specific strengths and limitations of current LLMs.
Abstract
We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Machine Learning and Data Classification
