View-Driven Deduplication with Active Learning
Kristi Morton, Hannaneh Hajishirzi, Magdalena Balazinska, Dan Grossman

TL;DR
This paper introduces a view-driven deduplication method that leverages active learning to efficiently improve data quality in visual analytics, requiring fewer labels and producing cleaner visualizations.
Contribution
It proposes a novel approach that considers the impact of data tuples on visualizations, optimizing deduplication with limited labeling budgets in visual analytics systems.
Findings
Produces significantly cleaner views with fewer labels
Outperforms state-of-the-art deduplication methods
Reduces labeling effort in data cleaning
Abstract
Visual analytics systems such as Tableau are increasingly popular for interactive data exploration. These tools, however, do not currently assist users with detecting or resolving potential data quality problems including the well-known deduplication problem. Recent approaches for deduplication focus on cleaning entire datasets and commonly require hundreds to thousands of user labels. In this paper, we address the problem of deduplication in the context of visual data analytics. We present a new approach for record deduplication that strives to produce the cleanest view possible with a limited budget for data labeling. The key idea behind our approach is to consider the impact that individual tuples have on a visualization and to monitor how the view changes during cleaning. With experiments on nine different visualizations for two real-world datasets, we show that our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Cryptography and Data Security · Advanced Data Storage Technologies
