A Data Source Dependency Analysis Framework for Large Scale Data Science Projects
Laurent Bou\'e, Pratap Kunireddy, Pavle Suboti\'c

TL;DR
This paper introduces an automated framework for mapping data source dependencies in large ML projects, helping MLOps engineers monitor and mitigate data-related issues proactively.
Contribution
It presents a unified static analysis-based system for identifying data source dependencies across various languages, integrated as a REST API for practical deployment.
Findings
Implemented and used by Microsoft MLOps engineers
Reliable identification of data sources across multiple languages
Facilitates proactive data dependency management
Abstract
Dependency hell is a well-known pain point in the development of large software projects and machine learning (ML) code bases are not immune from it. In fact, ML applications suffer from an additional form, namely, "data source dependency hell". This term refers to the central role played by data and its unique quirks that often lead to unexpected failures of ML models which cannot be explained by code changes. In this paper, we present an automated dependency mapping framework that allows MLOps engineers to monitor the whole dependency map of their models in a fast paced engineering environment and thus mitigate ahead of time the consequences of any data source changes (e.g., re-train model, ignore data, set default data etc.). Our system is based on a unified and generic approach, employing techniques from static analysis, from which data sources can be identified reliably for any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Data Quality and Management · Software Engineering Research
