Will It Break in Production? Metric-Driven Prediction of Residual Defects in Python Systems
Giuseppe De Rosa, Pietro Liguori

TL;DR
This study evaluates ML models for predicting residual faults in Python systems, finding that supervised metric-based models outperform LLMs and unsupervised methods, with process metrics being highly predictive.
Contribution
It demonstrates the effectiveness of supervised metric-based models over LLMs for fault prediction in Python, highlighting key predictive features like process metrics.
Findings
Supervised models achieve 0.85-0.9 recall in fault prediction.
Process metrics such as age, churn, and developer activity are highly predictive.
Metrics and code embeddings provide complementary information in fault detection.
Abstract
Python's dynamic nature complicates testing and increases the possibility that some defects evade detection, so an effective fault prediction becomes essential. We examine whether post-release faults can be predicted using modern ML and DL. Using a balanced dataset of over 4,000 labeled faults with 83 product, process, statistical, and Python-specific metrics plus normalized code representations, we conduct cross-project experiments. LLMs and unsupervised models fail to distinguish residual from non-residual faults, while supervised metric-based models (RandomForest, XGBoost, CatBoost) perform far better, yielding a 0.85-0.9 recall and cutting false negatives by an order of magnitude. Process metrics, especially age, churn, and developer-activity, alongside class and file size, consistently prove most predictive. Notably, the Principal Component Analysis shows that metrics and code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
