Active clustering for labeling training data
Quentin Lutz, \'Elie de Panafieu, Alex Scott, Maya Stein

TL;DR
This paper introduces an active clustering approach that minimizes human labeling effort by using pairwise queries and analyzes algorithms under different class models, aiming to reduce the total number of queries needed for effective clustering.
Contribution
It proposes a novel active clustering framework that leverages pairwise queries, characterizes optimal algorithms under specific class models, and addresses error handling in human responses.
Findings
Algorithms that minimize query count in the uniform class model are characterized.
Proposed algorithms are conjectured to be optimal in the fixed distribution model.
Performance comparisons show the proposed methods outperform random querying approaches.
Abstract
Gathering training data is a key step of any supervised learning task, and it is both critical and expensive. Critical, because the quantity and quality of the training data has a high impact on the performance of the learned function. Expensive, because most practical cases rely on humans-in-the-loop to label the data. The process of determining the correct labels is much more expensive than comparing two items to see whether they belong to the same class. Thus motivated, we propose a setting for training data gathering where the human experts perform the comparatively cheap task of answering pairwise queries, and the computer groups the items into classes (which can be labeled cheaply at the very end of the process). Given the items, we consider two random models for the classes: one where the set partition they form is drawn uniformly, the other one where each item chooses its class…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Management and Algorithms · Graph Theory and Algorithms · Semantic Web and Ontologies
