ZeroIn: Characterizing the Data Distributions of Commits in Software Repositories
Kalyan Perumalla (1), Aradhana Soni (1), Rupam Dey (1), Steven Rich, (2) ((1) University of Tennessee, (2) Cisco Systems Inc.)

TL;DR
ZeroIn analyzes software repository metadata to identify data distribution patterns, enabling the generation of synthetic datasets and improving machine learning models for commit quality assessment.
Contribution
The paper introduces a comprehensive characterization of software development metadata distributions, facilitating synthetic data generation and enhanced ML-based commit quality prediction.
Findings
Analyzed datasets include Stack Overflow and GitHub data with over 452 million repositories.
Identified key data distribution patterns relevant for machine learning applications.
Provided insights for generating synthetic datasets to improve commit quality classification.
Abstract
Modern software development is based on a series of rapid incremental changes collaboratively made to large source code repositories by developers with varying experience and expertise levels. The ZeroIn project is aimed at analyzing the metadata of these dynamic phenomena, including the data on repositories, commits, and developers, to rapidly and accurately mark the quality of commits as they arrive at the repositories. In this context, the present article presents a characterization of the software development metadata in terms of distributions of data that best captures the trends in the datasets. Multiple datasets are analyzed for this purpose, including Stack Overflow on developers' features and GitHub data on over 452 million repositories with 16 million commits. This characterization is intended to make it possible to generate multiple synthetic datasets that can be used in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Software System Performance and Reliability
