Empirical Analysis on CI/CD Pipeline Evolution in Machine Learning Projects
Dhia Elhaq Rzig, Alaa Houerbi, Rahul Ghanshyam Chavan, Foyzul Hassan

TL;DR
This paper provides the first empirical analysis of how CI/CD configurations evolve in machine learning projects, revealing common change patterns, developer expertise influences, and areas for improving CI/CD practices in ML software development.
Contribution
It introduces a taxonomy of CI/CD and ML component co-changes, a clustering tool for change patterns, and insights into developer expertise and common bad practices in ML CI/CD configurations.
Findings
61.8% of commits involve build policy changes
Frequent use of deprecated settings and dependencies
Experienced developers more likely to modify CI/CD configurations
Abstract
The growing popularity of machine learning (ML) and the integration of ML components with other software artifacts has led to the use of continuous integration and delivery (CI/CD) tools, such as Travis CI, GitHub Actions, etc. that enable faster integration and testing for ML projects. Such CI/CD configurations and services require synchronization during the life cycle of the projects. Several works discussed how CI/CD configuration and services change during their usage in traditional software systems. However, there is very limited knowledge of how CI/CD configuration and services change in ML projects. To fill this knowledge gap, this work presents the first empirical analysis of how CI/CD configuration evolves for ML software systems. We manually analyzed 343 commits collected from 508 open-source ML projects to identify common CI/CD configuration change categories in ML projects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications · Machine Learning and Data Classification
