Data Preparation for Deep Learning based Code Smell Detection: A Systematic Literature Review
Fengji Zhang, Zexian Zhang, Jacky Wai Keung, Xiangru Tang, Zhen Yang,, Xiao Yu, Wenhua Hu

TL;DR
This systematic review examines data preparation techniques for deep learning-based code smell detection, highlighting challenges, solutions, and best practices to improve dataset quality and effectiveness.
Contribution
It provides a comprehensive analysis of data preparation processes and offers practical recommendations for enhancing data quality in DL-based CSD methods.
Findings
Identified 36 relevant papers on DL-based CSD
Summarized seven key challenges and solutions in data preparation
Emphasized importance of data diversity, standardization, and accessibility
Abstract
Code Smell Detection (CSD) plays a crucial role in improving software quality and maintainability. And Deep Learning (DL) techniques have emerged as a promising approach for CSD due to their superior performance. However, the effectiveness of DL-based CSD methods heavily relies on the quality of the training data. Despite its importance, little attention has been paid to analyzing the data preparation process. This systematic literature review analyzes the data preparation techniques used in DL-based CSD methods. We identify 36 relevant papers published by December 2023 and provide a thorough analysis of the critical considerations in constructing CSD datasets, including data requirements, collection, labeling, and cleaning. We also summarize seven primary challenges and corresponding solutions in the literature. Finally, we offer actionable recommendations for preparing and accessing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Application Security Vulnerabilities · Software Engineering Research · Software Reliability and Analysis Research
