Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans
Stefan Grafberger

TL;DR
This paper introduces a method to extract, instrument, and analyze ML pipelines from code using logical query plans, enabling automated validation, monitoring, and advanced analysis without manual code modifications.
Contribution
It presents a novel approach to automatically infer, instrument, and rewrite ML pipelines from Python code using logical query plans, facilitating automated analysis and optimization.
Findings
Efficient extraction of pipeline semantics from Python code.
Lightweight provenance tracking for data issues.
Automated pipeline rewriting for advanced analyses.
Abstract
Machine Learning (ML) is increasingly used to automate impactful decisions, which leads to concerns regarding their correctness, reliability, and fairness. We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their ML pipelines. In contrast to existing work, our key idea is to extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code. First, we developed such an abstract ML pipeline representation together with machinery to extract it from Python code. Next, we used this representation to efficiently instrument static ML pipelines and apply provenance tracking, which enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Parallel Computing and Optimization Techniques · Network Packet Processing and Optimization
