
TL;DR
This paper develops statistical tests to determine if two datasets share the same clustering structure, especially in the context of mixtures of multivariate normal distributions, accounting for unknown parameters and high-dimensional settings.
Contribution
It introduces a comprehensive phase diagram for the testing problem and proposes adaptive tests that work in high-dimensional scenarios with unknown nuisance parameters.
Findings
Established the phase diagram for clustering equivalence testing.
Designed tests that adaptively achieve the detection boundary.
Validated the methods under high-dimensional asymptotics.
Abstract
In this paper, we test whether two datasets share a common clustering structure. As a leading example, we focus on comparing clustering structures in two independent random samples from two mixtures of multivariate normal distributions. Mean parameters of these normal distributions are treated as potentially unknown nuisance parameters and are allowed to differ. Assuming knowledge of mean parameters, we first determine the phase diagram of the testing problem over the entire range of signal-to-noise ratios by providing both lower bounds and tests that achieve them. When nuisance parameters are unknown, we propose tests that achieve the detection boundary adaptively as long as ambient dimensions of the datasets grow at a sub-linear rate with the sample size.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
