Toward Data Cleaning with a Target Accuracy: A Case Study for Value   Normalization

Adel Ardalan; Derek Paulsen; Amanpreet Singh Saini; Walter Cai; AnHai; Doan

arXiv:2101.05308·cs.DB·January 15, 2021

Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization

Adel Ardalan, Derek Paulsen, Amanpreet Singh Saini, Walter Cai, AnHai, Doan

PDF

Open Access

TL;DR

This paper explores the problem of achieving target accuracy in data cleaning, specifically focusing on value normalization, and proposes a new approach to improve efficiency and collaboration in cleaning processes.

Contribution

It introduces the problem of data cleaning with a target accuracy and presents a new solution template for value normalization that addresses verification and collaboration challenges.

Findings

01

Identifies limitations of current industry practices for value normalization.

02

Proposes a novel framework for achieving target accuracy in data cleaning.

03

Lays groundwork for future research in collaborative and efficient data cleaning methods.

Abstract

Many applications need to clean data with a target accuracy. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all string that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in industry by automatically clustering the strings then asking a user to verify and clean the clusters, until reaching 100% accuracy. This solution has significant limitations. It does not tell the users how to verify and clean the clusters. This part also often takes a lot of time, e.g., days. Further, there is no effective way for multiple users to collaboratively verify and clean. In this paper we address these challenges. Overall, our work advances the state of the art in data cleaning by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Data Mining Algorithms and Applications