Data Shapley Valuation for Efficient Batch Active Learning
Amirata Ghorbani, James Zou, Andre Esteva

TL;DR
This paper introduces Active Data Shapley (ADS), a method that efficiently pre-selects high-value data points for batch active learning using Shapley values, significantly improving efficiency while maintaining performance.
Contribution
The paper proposes ADS, a linear-time filtering layer based on data Shapley values, to enhance batch active learning efficiency especially in noisy, heterogeneous, and domain-shifted datasets.
Findings
ADS increases active learning efficiency by an average factor of 6x.
ADS preserves the effectiveness of state-of-the-art batch active learning methods.
ADS is particularly effective in real-world noisy and heterogeneous data scenarios.
Abstract
Annotating the right set of data amongst all available data points is a key challenge in many machine learning applications. Batch active learning is a popular approach to address this, in which batches of unlabeled data points are selected for annotation, while an underlying learning algorithm gets subsequently updated. Increasingly larger batches are particularly appealing in settings where data can be annotated in parallel, and model training is computationally expensive. A key challenge here is scale - typical active learning methods rely on diversity techniques, which select a diverse set of data points to annotate, from an unlabeled pool. In this work, we introduce Active Data Shapley (ADS) -- a filtering layer for batch active learning that significantly increases the efficiency of active learning by pre-selecting, using a linear time computation, the highest-value points from an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning
