Quality Assessment of Python Tests Generated by Large Language Models

Victor Alves; Carla Bezerra; Ivan Machado; Larissa Rocha; T\'assio Virg\'inio; Publio Silva

arXiv:2506.14297·cs.SE·June 18, 2025

Quality Assessment of Python Tests Generated by Large Language Models

Victor Alves, Carla Bezerra, Ivan Machado, Larissa Rocha, T\'assio Virg\'inio, Publio Silva

PDF

Open Access

TL;DR

This study evaluates the quality of Python test code generated by large language models, revealing common errors, test smells, and the influence of prompt context on test suite reliability.

Contribution

It provides a comparative analysis of LLMs' test generation quality, highlighting error patterns, test smells, and the impact of prompt design on test suite quality.

Findings

01

Most generated test suites contained errors or test smells.

02

GPT-4o produced the fewest errors among LLMs.

03

Prompt context significantly affects test quality and error rates.

Abstract

The manual generation of test scripts is a time-intensive, costly, and error-prone process, indicating the value of automated solutions. Large Language Models (LLMs) have shown great promise in this domain, leveraging their extensive knowledge to produce test code more efficiently. This study investigates the quality of Python test code generated by three LLMs: GPT-4o, Amazon Q, and LLama 3.3. We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Text2Code (T2C) and Code2Code (C2C). Our analysis includes the identification of errors and test smells, with a focus on correlating these issues to inadequate design patterns. Our findings reveal that most test suites generated by the LLMs contained at least one error or test smell. Assertion errors were the most common, comprising 64% of all identified errors, while the test smell Lack of Cohesion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications · Natural Language Processing Techniques