View-Driven Deduplication with Active Learning

Kristi Morton; Hannaneh Hajishirzi; Magdalena Balazinska; Dan Grossman

arXiv:1606.05708·cs.DB·June 21, 2016·1 cites

View-Driven Deduplication with Active Learning

Kristi Morton, Hannaneh Hajishirzi, Magdalena Balazinska, Dan Grossman

PDF

Open Access

TL;DR

This paper introduces a view-driven deduplication method that leverages active learning to efficiently improve data quality in visual analytics, requiring fewer labels and producing cleaner visualizations.

Contribution

It proposes a novel approach that considers the impact of data tuples on visualizations, optimizing deduplication with limited labeling budgets in visual analytics systems.

Findings

01

Produces significantly cleaner views with fewer labels

02

Outperforms state-of-the-art deduplication methods

03

Reduces labeling effort in data cleaning

Abstract

Visual analytics systems such as Tableau are increasingly popular for interactive data exploration. These tools, however, do not currently assist users with detecting or resolving potential data quality problems including the well-known deduplication problem. Recent approaches for deduplication focus on cleaning entire datasets and commonly require hundreds to thousands of user labels. In this paper, we address the problem of deduplication in the context of visual data analytics. We present a new approach for record deduplication that strives to produce the cleanest view possible with a limited budget for data labeling. The key idea behind our approach is to consider the impact that individual tuples have on a visualization and to monitor how the view changes during cleaning. With experiments on nine different visualizations for two real-world datasets, we show that our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Cryptography and Data Security · Advanced Data Storage Technologies