DataExposer: Exposing Disconnect between Data and Systems
Sainyam Galhotra, Anna Fariha, Raoni Louren\c{c}o, Juliana Freire,, Alexandra Meliou, Divesh Srivastava

TL;DR
DataExposer is a framework that identifies causally verified data properties causing system malfunctions, enabling targeted debugging of data-driven systems with fewer interventions than previous methods.
Contribution
It introduces a causal reasoning approach to debug data properties causing system failures, improving precision and efficiency over statistical correlation methods.
Findings
Accurately identifies root data causes of system failures
Requires significantly fewer interventions than prior techniques
Effective on real-world and synthetic data-driven systems
Abstract
As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of the data. For example, a health-monitoring system that is designed under the assumption that weight is reported in imperial units (lbs) will malfunction when encountering weight reported in metric units (kilograms). Similar to software debugging, which aims to find bugs in the mechanism (source code or runtime conditions), our goal is to debug the data to identify potential sources of disconnect between the assumptions about the data and the systems that operate on that data. Specifically, we seek which properties of the data cause a data-driven system to malfunction. We propose DataExposer, a framework to identify data properties, called profiles, that are the root causes of performance degradation or failure of a system that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Scientific Computing and Data Management · Data Quality and Management
