Data Balancing Improves Self-Admitted Technical Debt Detection
Murali Sridharan, Mika Mantyla, Leevi Rantala, Maelick Claes

TL;DR
This study investigates how different data balancing techniques affect the accuracy of Self-Admitted Technical Debt detection, revealing that classical machine learning models with balancing methods outperform deep learning in within-project scenarios.
Contribution
It provides an empirical comparison of multiple balancing techniques and models for SATD detection, including a new benchmark and a web-based prediction tool.
Findings
Data balancing improves SATD detection performance.
Classical ML models outperform deep learning in within-project detection.
SMOTE and ensemble methods are effective balancing techniques.
Abstract
A high imbalance exists between technical debt and non-technical debt source code comments. Such imbalance affects Self-Admitted Technical Debt (SATD) detection performance, and existing literature lacks empirical evidence on the choice of balancing technique. In this work, we evaluate the impact of multiple balancing techniques, including Data level, Classifier level, and Hybrid, for SATD detection in Within-Project and Cross-Project setup. Our results show that the Data level balancing technique SMOTE or Classifier level Ensemble approaches Random Forest or XGBoost are reasonable choices depending on whether the goal is to maximize Precision, Recall, F1, or AUC-ROC. We compared our best-performing model with the previous SATD detection benchmark (cost-sensitive Convolution Neural Network). Interestingly the top-performing XGBoost with SMOTE sampling improved the Within-project F1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
