Clustering Unclustered Data: Unsupervised Binary Labeling of Two Datasets Having Different Class Balances
Marthinus Christoffel du Plessis, Masashi Sugiyama

TL;DR
This paper presents a novel unsupervised method for binary labeling of two datasets with different class balances by estimating the sign of their density difference, bypassing traditional clustering limitations.
Contribution
It introduces a new approach to label unlabeled data using density difference sign estimation, applicable even when data isn't well-clustered.
Findings
The method outperforms traditional clustering in various datasets.
Direct density difference sign estimation is effective without explicit density modeling.
Applicable to real-world datasets with different class distributions.
Abstract
We consider the unsupervised learning problem of assigning labels to unlabeled data. A naive approach is to use clustering methods, but this works well only when data is properly clustered and each cluster corresponds to an underlying class. In this paper, we first show that this unsupervised labeling problem in balanced binary cases can be solved if two unlabeled datasets having different class balances are available. More specifically, estimation of the sign of the difference between probability densities of two unlabeled datasets gives the solution. We then introduce a new method to directly estimate the sign of the density difference without density estimation. Finally, we demonstrate the usefulness of the proposed method against several clustering methods on various toy problems and real-world datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Automated Road and Building Extraction · Rough Sets and Fuzzy Logic
