ED2: Two-stage Active Learning for Error Detection -- Technical Report
Felix Neutatz, Mohammad Mahdavi, Ziawasch Abedjan

TL;DR
ED2 introduces a semi-supervised, active learning approach with novel sampling and features for error detection, achieving high accuracy with minimal labeled data on real datasets.
Contribution
The paper presents ED2, a new two-stage active learning method with multi-column features that improves error detection efficiency and accuracy.
Findings
ED2 requires less than 1% labels to outperform existing methods
Fast convergence of classification with high detection accuracy
Effective on multiple real-world datasets
Abstract
Traditional error detection approaches require user-defined parameters and rules. Thus, the user has to know both the error detection system and the data. However, we can also formulate error detection as a semi-supervised classification problem that only requires domain expertise. The challenges for such an approach are twofold: (1) to represent the data in a way that enables a classification model to identify various kinds of data errors, and (2) to pick the most promising data values for learning. In this paper, we address these challenges with ED2, our new example-driven error detection method. First, we present a new two-dimensional multi-classifier sampling strategy for active learning. Second, we propose novel multi-column features. The combined application of these techniques provides fast convergence of the classification task with high detection accuracy. On several real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Imbalanced Data Classification Techniques
