CEREAL: Few-Sample Clustering Evaluation

Nihal V. Nayak; Ethan R. Elenberg; Clemens Rosenbaum

arXiv:2210.00064·cs.LG·October 4, 2022

CEREAL: Few-Sample Clustering Evaluation

Nihal V. Nayak, Ethan R. Elenberg, Clemens Rosenbaum

PDF

Open Access

TL;DR

CEREAL is a comprehensive framework that improves the estimation of clustering quality with limited labels by combining active sampling, semi-supervised learning, and pseudo-labeling, reducing bias and annotation effort.

Contribution

It introduces novel NMI-based acquisition functions, integrates semi-supervised training of surrogate models, and extends evaluation to pairwise annotations, advancing few-sample clustering evaluation.

Findings

01

Reduces estimation error by up to 57% compared to baselines.

02

Effective across vision and language datasets.

03

Agnostic to clustering algorithms and metrics.

Abstract

Evaluating clustering quality with reliable evaluation metrics like normalized mutual information (NMI) requires labeled data that can be expensive to annotate. We focus on the underexplored problem of estimating clustering quality with limited labels. We adapt existing approaches from the few-sample model evaluation literature to actively sub-sample, with a learned surrogate model, the most informative data points for annotation to estimate the evaluation metric. However, we find that their estimation can be biased and only relies on the labeled data. To that end, we introduce CEREAL, a comprehensive framework for few-sample clustering evaluation that extends active sampling approaches in three key ways. First, we propose novel NMI-based acquisition functions that account for the distinctive properties of clustering and uncertainties from a learned surrogate model. Next, we use ideas…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI · Advanced Clustering Algorithms Research · Domain Adaptation and Few-Shot Learning