A Data Source Dependency Analysis Framework for Large Scale Data Science   Projects

Laurent Bou\'e; Pratap Kunireddy; Pavle Suboti\'c

arXiv:2212.07951·cs.SE·December 16, 2022

A Data Source Dependency Analysis Framework for Large Scale Data Science Projects

Laurent Bou\'e, Pratap Kunireddy, Pavle Suboti\'c

PDF

Open Access

TL;DR

This paper introduces an automated framework for mapping data source dependencies in large ML projects, helping MLOps engineers monitor and mitigate data-related issues proactively.

Contribution

It presents a unified static analysis-based system for identifying data source dependencies across various languages, integrated as a REST API for practical deployment.

Findings

01

Implemented and used by Microsoft MLOps engineers

02

Reliable identification of data sources across multiple languages

03

Facilitates proactive data dependency management

Abstract

Dependency hell is a well-known pain point in the development of large software projects and machine learning (ML) code bases are not immune from it. In fact, ML applications suffer from an additional form, namely, "data source dependency hell". This term refers to the central role played by data and its unique quirks that often lead to unexpected failures of ML models which cannot be explained by code changes. In this paper, we present an automated dependency mapping framework that allows MLOps engineers to monitor the whole dependency map of their models in a fast paced engineering environment and thus mitigate ahead of time the consequences of any data source changes (e.g., re-train model, ignore data, set default data etc.). Our system is based on a unified and generic approach, employing techniques from static analysis, from which data sources can be identified reliably for any…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Data Quality and Management · Software Engineering Research