SACS: A Code Smell Dataset using Semi-automatic Generation Approach

Hanyu Zhang; Tomoji Kishi

arXiv:2602.15342·cs.SE·April 21, 2026

SACS: A Code Smell Dataset using Semi-automatic Generation Approach

Hanyu Zhang, Tomoji Kishi

PDF

TL;DR

This paper presents SACS, a high-quality, semi-automatically generated dataset of over 30,000 labeled code smell samples across three categories, aimed at advancing machine learning-based code smell detection.

Contribution

The study introduces a semi-automatic approach combining automatic rules and manual review to create a large, reliable code smell dataset, addressing data quality issues in existing datasets.

Findings

01

Created an open-source dataset with over 10,000 samples per code smell category.

02

Applied structured review guidelines and an annotation tool for manual validation.

03

Facilitates future research in code smell detection and automated refactoring.

Abstract

Code smell is a great challenge in software refactoring, which indicates latent design or implementation flaws that may degrade the software maintainability and evolution. Over the past of decades, the research on code smell has received extensive attention. Especially the researches applied machine learning-technique have become a popular topic in recent studies. However, one of the biggest challenges to apply machine learning-technique is the lack of high-quality code smell datasets. Manually constructing such datasets is extremely labor-intensive, as identifying code smells requires substantial development expertise and considerable time investment. In contrast, automatically generated datasets, while scalable, frequently exhibit reduced label reliability and compromised data quality. To overcome this challenge, in this study, we explore a semi-automatic approach to generate a code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.