Separating populations with wide data: A spectral analysis
Avrim Blum, Amin Coja-Oghlan, Alan Frieze, Shuheng Zhou

TL;DR
This paper analyzes a spectral method for partitioning data from a mixture of distributions, especially when features are of low quality, aiming to minimize features used while maintaining accuracy.
Contribution
It introduces a spectral analysis approach that optimizes data and feature usage for accurate population partitioning with low-quality features.
Findings
Spectral technique effectively partitions data with minimal features.
Method scales with data size and feature quality.
Applicable to clustering populations with small divergence.
Abstract
In this paper, we consider the problem of partitioning a small data sample drawn from a mixture of product distributions. We are interested in the case that individual features are of low average quality , and we want to use as few of them as possible to correctly partition the sample. We analyze a spectral technique that is able to approximately optimize the total data size--the product of number of data points and the number of features --needed to correctly perform this partitioning as a function of for . Our goal is motivated by an application in clustering individuals according to their population of origin using markers, when the divergence between any two of the populations is small.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
