Towards a Change Taxonomy for Machine Learning Systems
Aaditya Bhatia, Ellis E. Eghan, Manel Grichi, William G., Cavanagh, Zhen Ming (Jack) Jiang, Bram Adams

TL;DR
This paper empirically analyzes contributions to ML research repositories, revealing low contribution rates, types of changes made, and missed opportunities for collaboration, while extending a taxonomy of code changes with ML-specific categories.
Contribution
It introduces a new ML-specific change category and sub-categories to Hindle et al.'s taxonomy, and provides empirical insights into contribution patterns in ML repositories.
Findings
Only 9% of forks modify the parent repository.
52% of changes from forks are accepted by parent repositories.
15 new ML-specific change sub-categories are identified.
Abstract
Machine Learning (ML) research publications commonly provide open-source implementations on GitHub, allowing their audience to replicate, validate, or even extend machine learning algorithms, data sets, and metadata. However, thus far little is known about the degree of collaboration activity happening on such ML research repositories, in particular regarding (1) the degree to which such repositories receive contributions from forks, (2) the nature of such contributions (i.e., the types of changes), and (3) the nature of changes that are not contributed back to forks, which might represent missed opportunities. In this paper, we empirically study contributions to 1,346 ML research repositories and their 67,369 forks, both quantitatively and qualitatively (by building on Hindle et al.'s seminal taxonomy of code changes). We found that while ML research repositories are heavily forked,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Data Storage Technologies · Software Engineering Research
