Clustering Via Crowdsourcing

Arya Mazumdar; Barna Saha

arXiv:1604.01839·cs.DS·April 8, 2016·19 cites

Clustering Via Crowdsourcing

Arya Mazumdar, Barna Saha

PDF

Open Access

TL;DR

This paper explores efficient crowdsourcing algorithms for entity resolution, reducing query complexity using side information and handling noisy crowd responses, with theoretical bounds and parallelizable solutions.

Contribution

It introduces new information-theoretic bounds and algorithms that minimize queries for clustering with noisy crowd answers and side information.

Findings

01

Query complexity reduced to linear or sublinear in n

02

Algorithms are near-optimal and parallelizable

03

Bounds closely match theoretical limits

Abstract

In recent years, crowdsourcing, aka human aided computation has emerged as an effective platform for solving problems that are considered complex for machines alone. Using human is time-consuming and costly due to monetary compensations. Therefore, a crowd based algorithm must judiciously use any information computed through an automated process, and ask minimum number of questions to the crowd adaptively. One such problem which has received significant attention is {\em entity resolution}. Formally, we are given a graph $G = (V, E)$ with unknown edge set $E$ where $G$ is a union of $k$ (again unknown, but typically large $O (n^{α})$ , for $α > 0$ ) disjoint cliques $G_{i} (V_{i}, E_{i})$ , $i = 1, \dots, k$ . The goal is to retrieve the sets $V_{i}$ s by making minimum number of pair-wise queries $V \times V \to {\pm 1}$ to an oracle (the crowd). When the answer to each query is correct, e.g. via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Privacy-Preserving Technologies in Data · Data Quality and Management