Automatic Data Repair: Are We Ready to Deploy?
Wei Ni, Xiaoye Miao, Xiangyu Zhao, Yangyang Wu, Jianwei Yin

TL;DR
This paper systematically evaluates 12 data repair algorithms across various error scenarios and proposes a unified optimization strategy, providing practical guidelines and insights for deploying data repair in real-world applications.
Contribution
It offers a comprehensive comparison and taxonomy of data repair algorithms, introduces a novel evaluation metric, and presents a unified repair optimization strategy with practical deployment guidelines.
Findings
Data repair improves analysis performance regardless of error rate.
Pure clean data does not always yield optimal results.
The unified repair strategy significantly enhances existing methods.
Abstract
Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from generative models. The study of repairing erroneous data has gained significant importance. Existing data repair algorithms differ in information utilization, problem settings, and are tested in limited scenarios. In this paper, we initially compare and summarize these algorithms using a new guided information-based taxonomy. We then systematically conduct a comprehensive evaluation of 12 mainstream data repair algorithms under the settings of various data error rates, error types, and downstream analysis tasks, assessing their error reduction performance with a novel metric. Also, we develop an effective and unified repair optimization strategy that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Big Data and Business Intelligence · Privacy-Preserving Technologies in Data
