SACS: A Code Smell Dataset using Semi-automatic Generation Approach
Hanyu Zhang, Tomoji Kishi

TL;DR
This paper presents SACS, a high-quality, semi-automatically generated dataset of over 30,000 labeled code smell samples across three categories, aimed at advancing machine learning-based code smell detection.
Contribution
The study introduces a semi-automatic approach combining automatic rules and manual review to create a large, reliable code smell dataset, addressing data quality issues in existing datasets.
Findings
Created an open-source dataset with over 10,000 samples per code smell category.
Applied structured review guidelines and an annotation tool for manual validation.
Facilitates future research in code smell detection and automated refactoring.
Abstract
Code smell is a great challenge in software refactoring, which indicates latent design or implementation flaws that may degrade the software maintainability and evolution. Over the past of decades, the research on code smell has received extensive attention. Especially the researches applied machine learning-technique have become a popular topic in recent studies. However, one of the biggest challenges to apply machine learning-technique is the lack of high-quality code smell datasets. Manually constructing such datasets is extremely labor-intensive, as identifying code smells requires substantial development expertise and considerable time investment. In contrast, automatically generated datasets, while scalable, frequently exhibit reduced label reliability and compromised data quality. To overcome this challenge, in this study, we explore a semi-automatic approach to generate a code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
