Investigating the Impact of Metric Aggregation Techniques on Defect Prediction
Rawad Abou Assi

TL;DR
This study evaluates how different metric aggregation techniques affect defect prediction models, finding that simple summation generally outperforms more complex methods and combining multiple techniques offers no significant benefit.
Contribution
The paper systematically compares nine aggregation techniques for code metrics and demonstrates that summation remains the most effective method for defect prediction.
Findings
Summation aggregation yields better defect prediction performance.
Complex aggregation methods do not significantly outperform simple summation.
Combining multiple aggregation techniques does not improve prediction accuracy.
Abstract
Code metrics collected at the method level are often aggregated using summation to capture system properties at higher levels (e.g., file- or package-level). Since defect data is often available at these higher levels, this aggregation allows researchers to build defect prediction models. Recent findings by Landman et al. indicate that aggregation is likely to inflate the correlation between size and complexity metrics. In this paper, we explore the effect of nine aggregation techniques on the correlation between three types of code metrics, namely Lines of Code, McCabe, and Halstead metrics. In addition to summation, we study aggregation techniques that are measures of: (1) central tendency (average and median), (2) dispersion (standard deviation and inter-quartile range), (3) shape (skewness and kurtosis), and (4) income inequality (Theil index and Gini coefficient). Our results show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Data Storage Technologies · Algorithms and Data Compression
