Learning Dependency Models for Subset Repair
Haoda Li, Jiahui Chen, Yu Sun, Shaoxu Song, Haiwei Zhang, Xiaojie Yuan

TL;DR
This paper introduces new methods for subset repair in data with inconsistencies by leveraging attribute dependencies, including exact, approximate, and probabilistic algorithms validated on real datasets.
Contribution
It formalizes the optimal subset repair problem considering attribute dependencies, analyzes its complexity, and proposes multiple algorithms with theoretical guarantees.
Findings
Exact solutions via integer linear programming
Effective approximate algorithms with performance guarantees
Probabilistic approach with efficiency and approximation bounds
Abstract
Inconsistent values are commonly encountered in real-world applications, which can negatively impact data analysis and decision-making. While existing research primarily focuses on identifying the smallest removal set to resolve inconsistencies, recent studies have shown that multiple minimum removal sets may exist, making it difficult to make further decisions. While some approaches use the most frequent values as the guidance for the subset repair, this strategy has been criticized for its potential to inaccurately identify errors. To address these issues, we consider the dependencies between attribute values to determine a more appropriate subset repair. Our main contributions include (1) formalizing the optimal subset repair problem with attribute dependencies and analyzing its computational hardness; (2) computing the exact solution using integer linear programming; (3) developing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Machine Learning and Data Classification · Data Stream Mining Techniques
