A Data Prism: Semi-Verified Learning in the Small-Alpha Regime
Michela Meister, Gregory Valiant

TL;DR
This paper introduces a semi-verified learning model that efficiently recovers most true variable values from large, noisy, crowdsourced data when a sufficient fraction of evaluators are reliable, even with limited verified data.
Contribution
It provides a theoretical framework and an efficient algorithm for semi-verified learning in the small-alpha regime, extending understanding of data extraction from unreliable crowdsourced datasets.
Findings
Achieves accurate recovery with a large number of evaluators, exceeding n^r
Runs in linear time relative to dataset size
Applicable to practical scenarios like extracting cohort preferences from large datasets
Abstract
We consider a model of unreliable or crowdsourced data where there is an underlying set of binary variables, each evaluator contributes a (possibly unreliable or adversarial) estimate of the values of some subset of of the variables, and the learner is given the true value of a constant number of variables. We show that, provided an -fraction of the evaluators are "good" (either correct, or with independent noise rate ), then the true values of a fraction of the underlying variables can be deduced as long as . This setting can be viewed as an instance of the semi-verified learning model introduced in [CSV17], which explores the tradeoff between the number of items evaluated by each worker and the fraction of good evaluators. Our results require the number of evaluators to be extremely large, , although our algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Complexity and Algorithms in Graphs · Distributed Sensor Networks and Detection Algorithms
