Pitfalls and Guidelines for Using Time-Based Git Data
Samuel W. Flint, Jigyasa Chauhan, Robert Dyer

TL;DR
This paper surveys the use of time-based data in software engineering research, quantifies its prevalence, identifies common sources of data dirtiness, and offers best practices for researchers to improve data quality.
Contribution
It provides the first comprehensive analysis of time-based data usage in MSR papers and offers practical guidelines to handle data quality issues in such research.
Findings
38% of MSR papers use time-based data
Most data comes from Git commits on GitHub
Identified multiple sources of dirty timestamp data
Abstract
Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could affect the research results. This paper presents an extended survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 754 technical track and data papers published in MSR 2004--2021, we saw at least 290 (38%) papers utilized time-based data. We also observed that most time-based data used in research papers comes in the form of Git…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Engineering Research · Software Engineering Techniques and Practices
