Quality Assessment of Python Tests Generated by Large Language Models
Victor Alves, Carla Bezerra, Ivan Machado, Larissa Rocha, T\'assio Virg\'inio, Publio Silva

TL;DR
This study evaluates the quality of Python test code generated by large language models, revealing common errors, test smells, and the influence of prompt context on test suite reliability.
Contribution
It provides a comparative analysis of LLMs' test generation quality, highlighting error patterns, test smells, and the impact of prompt design on test suite quality.
Findings
Most generated test suites contained errors or test smells.
GPT-4o produced the fewest errors among LLMs.
Prompt context significantly affects test quality and error rates.
Abstract
The manual generation of test scripts is a time-intensive, costly, and error-prone process, indicating the value of automated solutions. Large Language Models (LLMs) have shown great promise in this domain, leveraging their extensive knowledge to produce test code more efficiently. This study investigates the quality of Python test code generated by three LLMs: GPT-4o, Amazon Q, and LLama 3.3. We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Text2Code (T2C) and Code2Code (C2C). Our analysis includes the identification of errors and test smells, with a focus on correlating these issues to inadequate design patterns. Our findings reveal that most test suites generated by the LLMs contained at least one error or test smell. Assertion errors were the most common, comprising 64% of all identified errors, while the test smell Lack of Cohesion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Natural Language Processing Techniques
