Measuring the Influence of Incorrect Code on Test Generation

Dong Huang; Jie M. Zhang; Mark Harman; Mingzhe Du; Heming Cui

arXiv:2409.09464·cs.SE·March 31, 2025·2 cites

Measuring the Influence of Incorrect Code on Test Generation

Dong Huang, Jie M. Zhang, Mark Harman, Mingzhe Du, Heming Cui

PDF

Open Access 1 Repo

TL;DR

This study empirically measures how the correctness of code under test influences the effectiveness of large language models in generating tests, revealing significant performance differences and practical implications.

Contribution

It provides the first comprehensive empirical analysis quantifying the impact of code correctness on LLM-based test generation across multiple models and datasets.

Findings

01

LLMs generate 57% more accurate tests with correct code

02

Test coverage improves by 12% with correct code

03

Bug detection increases by 24% when code is correct

Abstract

It is natural to suppose that a Large Language Model is more likely to generate correct test cases when prompted with correct code under test, compared to incorrect code under test. However, the size of this effect has never been previously measured, despite its obvious importance for both practicing software engineers and researchers. To answer the question, we conducted a comprehensive empirical study on 5 open source and 6 closed source language models, with 3 widely-used benchmark data sets together with 41 repo-level real-world examples from two different real-world data sets. Our results reveal that, when compared to incorrect code under test, LLMs prompted with correct code achieve improvements in test accuracy, code coverage, and bug detection of 57\%, 12\%, and 24\% respectively. We further show that these scientific conclusions carry over from the three benchmark data sets to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huangd1999/EmpiricalStudyofTestGeneration
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Reliability and Analysis Research · Real-time simulation and control systems