Defect Prediction with Content-based Features
Hung Viet Pham, Tung Thanh Nguyen

TL;DR
This paper investigates the use of source code content features like words and topics for defect prediction, demonstrating they outperform traditional complexity metrics and can be enhanced with feature selection.
Contribution
It introduces content-based features for defect prediction and empirically shows their superior predictive power over traditional metrics.
Findings
Content features outperform complexity metrics in defect prediction.
Feature selection and combination improve prediction accuracy.
Content-based approach provides a new perspective for defect prediction.
Abstract
Traditional defect prediction approaches often use metrics that measure the complexity of the design or implementing code of a software system, such as the number of lines of code in a source file. In this paper, we explore a different approach based on content of source code. Our key assumption is that source code of a software system contains information about its technical aspects and those aspects might have different levels of defect-proneness. Thus, content-based features such as words, topics, data types, and package names extracted from a source code file could be used to predict its defects. We have performed an extensive empirical evaluation and found that: i) such content-based features have higher predictive power than code complexity metrics and ii) the use of feature selection, reduction, and combination further improves the prediction performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Manufacturing Process and Optimization
