Moving Fast With Broken Data
Shreya Shankar, Labib Fawaz, Karl Gyllstrom, Aditya G. Parameswaran

TL;DR
This paper introduces an automatic data validation system for ML pipelines at Meta, using partition summaries and a novel method called GATE to effectively detect corrupted data partitions, improving model reliability.
Contribution
The paper presents a new Partition Summarization approach and GATE method for high-precision data validation in large-scale ML pipelines, addressing the challenge of detecting data corruption.
Findings
GATE achieved 2.1x higher precision than baseline methods.
Partition Summarization enables flexible data validation techniques.
Lessons learned from deployment inform best practices for production ML pipelines.
Abstract
Machine learning (ML) models in production pipelines are frequently retrained on the latest partitions of large, continually-growing datasets. Due to engineering bugs, partitions in such datasets almost always have some corrupted features; thus, it's critical to detect data issues and block retraining before downstream ML model accuracy decreases. However, it's difficult to identify when a partition is corrupted enough to block retraining. Blocking too often yields stale model snapshots in production; blocking too little yields broken model snapshots in production. In this paper, we present an automatic data validation system for ML pipelines implemented at Meta. We employ what we call a Partition Summarization (PS) approach to data validation: each timestamp-based partition of data is summarized with data quality metrics, and summaries are compared to detect corrupted partitions. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Data Quality and Management · Data Stream Mining Techniques
