Instrumentation and Analysis of Native ML Pipelines via Logical Query   Plans

Stefan Grafberger

arXiv:2407.07560·cs.DB·September 4, 2024

Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans

Stefan Grafberger

PDF

Open Access

TL;DR

This paper introduces a method to extract, instrument, and analyze ML pipelines from code using logical query plans, enabling automated validation, monitoring, and advanced analysis without manual code modifications.

Contribution

It presents a novel approach to automatically infer, instrument, and rewrite ML pipelines from Python code using logical query plans, facilitating automated analysis and optimization.

Findings

01

Efficient extraction of pipeline semantics from Python code.

02

Lightweight provenance tracking for data issues.

03

Automated pipeline rewriting for advanced analyses.

Abstract

Machine Learning (ML) is increasingly used to automate impactful decisions, which leads to concerns regarding their correctness, reliability, and fairness. We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their ML pipelines. In contrast to existing work, our key idea is to extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code. First, we developed such an abstract ML pipeline representation together with machinery to extract it from Python code. Next, we used this representation to efficiently instrument static ML pipelines and apply provenance tracking, which enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Parallel Computing and Optimization Techniques · Network Packet Processing and Optimization