Evaluating LLM-Based Test Generation Under Software Evolution

Sabaat Haroon; Mohammad Taha Khan; Muhammad Ali Gulzar

arXiv:2603.23443·cs.SE·March 25, 2026

Evaluating LLM-Based Test Generation Under Software Evolution

Sabaat Haroon, Mohammad Taha Khan, Muhammad Ali Gulzar

PDF

Open Access

TL;DR

This study empirically evaluates how large language models generate unit tests and respond to code changes, revealing their reliance on superficial cues and challenges in maintaining regression detection during software evolution.

Contribution

It provides a large-scale analysis of LLM-based test generation under program changes, highlighting limitations in semantic understanding and regression awareness.

Findings

01

LLMs achieve high coverage on original code but decline after changes.

02

Over 99% of failing tests pass on original code, indicating superficial alignment.

03

Test generation is sensitive to lexical changes, not just semantic modifications.

Abstract

Large Language Models (LLMs) are increasingly used for automated unit test generation. However, it remains unclear whether these tests reflect genuine reasoning about program behavior or simply reproduce superficial patterns learned during training. If the latter dominates, LLM-generated tests may exhibit weaknesses such as reduced coverage, missed regressions, and undetected faults. Understanding how LLMs generate tests and how those tests respond to code evolution is therefore essential. We present a large-scale empirical study of LLM-based test generation under program changes. Using an automated mutation-driven framework, we analyze how generated tests react to semantic-altering changes (SAC) and semantic-preserving changes (SPC) across eight LLMs and 22,374 program variants. LLMs achieve strong baseline results, reaching 79% line coverage and 76% branch coverage with fully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Engineering Techniques and Practices