Distance-based Data Cleaning: A Survey (Technical Report)
Yu Sun, Jian Zhang

TL;DR
This survey reviews distance-based data cleaning methods that leverage similarity measures to improve data quality, especially in scenarios with sparse or heterogeneous data, by classifying tasks like error detection and data repair.
Contribution
It provides a comprehensive classification and review of distance-based data cleaning techniques across four main tasks, highlighting their importance and potential for future research.
Findings
Distance-based methods effectively handle data heterogeneity.
Similarity neighbors improve error detection accuracy.
Distance relationships guide data repair processes.
Abstract
With the rapid development of the internet technology, dirty data are commonly observed in various real scenarios, e.g., owing to unreliable sensor reading, transmission and collection from heterogeneous sources. To deal with their negative effects on downstream applications, data cleaning approaches are designed to preprocess the dirty data before conducting applications. The idea of most data cleaning methods is to identify or correct dirty data, referring to the values of their neighbors which share the same information. Unfortunately, owing to data sparsity and heterogeneity, the number of neighbors based on equality relationship is rather limited, especially in the presence of data values with variances. To tackle this problem, distance-based data cleaning approaches propose to consider similarity neighbors based on value distance. By tolerance of small variants, the enriched…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Digital and Cyber Forensics
