Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study

E. G. Santana Jr; Jander Pereira Santos Junior; Erlon P. Almeida; Iftekhar Ahmed; Paulo Anselmo da Mota Silveira Neto; and Eduardo Santana de Almeida

arXiv:2506.07594·cs.SE·June 10, 2025

Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study

E. G. Santana Jr, Jander Pereira Santos Junior, Erlon P. Almeida, Iftekhar Ahmed, Paulo Anselmo da Mota Silveira Neto, and Eduardo Santana de Almeida

PDF

Open Access 1 Repo

TL;DR

This study evaluates the ability of various Large Language Models to detect and refactor test smells in code, demonstrating that Gemini-1.5 Pro performs best in detection accuracy and test coverage improvement, highlighting LLMs' potential in automated test maintenance.

Contribution

It provides the first comprehensive empirical assessment of LLMs for both detection and refactoring of test smells across multiple languages.

Findings

01

Gemini achieved highest detection accuracy (74.35% Python, 80.32% Java).

02

LLaMA showed the lowest detection accuracy among models.

03

Gemini improved test coverage, unlike GPT-4 and LLaMA.

Abstract

Test smells indicate poor development practices in test code, reducing maintainability and reliability. While developers often struggle to prevent or refactor these issues, existing tools focus primarily on detection rather than automated refactoring. Large Language Models (LLMs) have shown strong potential in code understanding and transformation, but their ability to both identify and refactor test smells remains underexplored. We evaluated GPT-4-Turbo, LLaMA 3 70B, and Gemini-1.5 Pro on Python and Java test suites, using PyNose and TsDetect for initial smell detection, followed by LLM-driven refactoring. Gemini achieved the highest detection accuracy (74.35\% Python, 80.32\% Java), while LLaMA was lowest. All models could refactor smells, but effectiveness varied, sometimes introducing new smells. Gemini also improved test coverage, unlike GPT-4 and LLaMA, which often reduced it.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ts-group-icse26/testsmells.llms.study-replication.package-icse26
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Engineering Techniques and Practices