A Data Management Approach for Dataset Selection Using Human Computation
Alexandros Ntoulas, Omar Alonso, Vasilis Kandylas

TL;DR
This paper presents a methodology and prototype system for efficiently selecting data for labeling in large-scale machine learning tasks, reducing costs by optimizing instance selection for human annotation.
Contribution
It introduces a novel data selection algorithm integrated into a framework for dataset management that minimizes labeling costs in industrial-scale applications.
Findings
Achieved good performance with reduced labeling costs.
Demonstrated effectiveness of the selection algorithm on web-scale problems.
Integrated human computation into dataset management for efficient training data generation.
Abstract
As the number of applications that use machine learning algorithms increases, the need for labeled data useful for training such algorithms intensifies. Getting labels typically involves employing humans to do the annotation, which directly translates to training and working costs. Crowdsourcing platforms have made labeling cheaper and faster, but they still involve significant costs, especially for the cases where the potential set of candidate data to be labeled is large. In this paper we describe a methodology and a prototype system aiming at addressing this challenge for Web-scale problems in an industrial setting. We discuss ideas on how to efficiently select the data to use for training of machine learning algorithms in an attempt to reduce cost. We show results achieving good performance with reduced cost by carefully selecting which instances to label. Our proposed algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Mobile Crowdsensing and Crowdsourcing · Machine Learning and Data Classification
