The Technical Debt Dataset
Valentina Lenarduzzi, Nyyti Saarim\"aki, Davide Taibi

TL;DR
The paper introduces a comprehensive dataset from 33 Java projects, combining static analysis, code smells, refactoring, and fault data to support empirical research on technical debt.
Contribution
It provides a curated, multi-faceted dataset from open-source projects, enabling standardized comparisons and analysis of technical debt indicators.
Findings
Analyzed 78,000 commits across 33 projects.
Detected 1.8 million SonarQube issues and 38,000 code smells.
Identified 28,000 faults and 57,000 refactorings.
Abstract
Technical Debt analysis is increasing in popularity as nowadays researchers and industry are adopting various tools for static code analysis to evaluate the quality of their code. Despite this, empirical studies on software projects are expensive because of the time needed to analyze the projects. In addition, the results are difficult to compare as studies commonly consider different projects. In this work, we propose the Technical Debt Dataset, a curated set of project measurement data from 33 Java projects from the Apache Software Foundation. In the Technical Debt Dataset, we analyzed all commits from separately defined time frames with SonarQube to collect Technical Debt information and with Ptidej to detect code smells. Moreover, we extracted all available commit information from the git logs, the refactoring applied with Refactoring Miner, and fault information reported in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
