
TL;DR
This paper proposes Seafaring, an efficient method to perform active learning from the vast, unstructured web data pool, demonstrated on a large-scale image dataset, improving label efficiency over traditional small pools.
Contribution
It introduces Seafaring, a user-side retrieval algorithm for active learning from extremely large web data pools without task-specific pool construction.
Findings
Seafaring outperforms existing methods on large-scale web data.
The method effectively retrieves informative data from over ten billion images.
Active learning from web data reduces labeling costs significantly.
Abstract
Labeling data is one of the most costly processes in machine learning pipelines. Active learning is a standard approach to alleviating this problem. Pool-based active learning first builds a pool of unlabelled data and iteratively selects data to be labeled so that the total number of required labels is minimized, keeping the model performance high. Many effective criteria for choosing data from the pool have been proposed in the literature. However, how to build the pool is less explored. Specifically, most of the methods assume that a task-specific pool is given for free. In this paper, we advocate that such a task-specific pool is not always available and propose the use of a myriad of unlabelled data on the Web for the pool for which active learning is applied. As the pool is extremely large, it is likely that relevant data exist in the pool for many tasks, and we do not need to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Advanced Bandit Algorithms Research
