Complexity and Efficient Algorithms for Data Inconsistency Evaluating and Repairing
Dongjing Miao, Zhipeng Cai, Jianzhong Li, Xiangyu Gao, Xianmin Liu

TL;DR
This paper investigates the complexity of data inconsistency evaluation and repair, providing new bounds, approximation algorithms, and a fast estimator for optimal repair size, enhancing data quality management techniques.
Contribution
It improves the understanding of the complexity of optimal subset repair, introduces new approximation algorithms, and develops a sublinear estimator for repair size in data inconsistency evaluation.
Findings
Optimal subset repair is NP-hard and APX-complete.
New approximation algorithms with specific ratios are proposed.
A sublinear estimator for repair size achieves high-probability accuracy.
Abstract
Data inconsistency evaluating and repairing are major concerns in data quality management. As the basic computing task, optimal subset repair is not only applied for cost estimation during the progress of database repairing, but also directly used to derive the evaluation of database inconsistency. Computing an optimal subset repair is to find a minimum tuple set from an inconsistent database whose remove results in a consistent subset left. Tight bound on the complexity and efficient algorithms are still unknown. In this paper, we improve the existing complexity and algorithmic results, together with a fast estimation on the size of optimal subset repair. We first strengthen the dichotomy for optimal subset repair computation problem, we show that it is not only APXcomplete, but also NPhard to approximate an optimal subset repair with a factor better than for most cases. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCryptography and Data Security · Data Quality and Management · Privacy-Preserving Technologies in Data
