Two-cluster test
Xinying Liu, Lianyu Hu, Mudi Jiang, Simeng Zhang, Jun Lou, and Zengyou He

TL;DR
This paper introduces a novel two-cluster test to accurately determine if two data subsets originate from the same cluster, addressing inflated error rates of traditional tests in clustering contexts.
Contribution
It presents a new significance testing method based on boundary points, reducing Type-I errors and applicable to interpretability and hierarchical clustering.
Findings
Significantly reduces Type-I error rate compared to classic tests
Effective in synthetic and real data experiments
Applicable in tree-based interpretable clustering
Abstract
Cluster analysis is a fundamental research issue in statistics and machine learning. In many modern clustering methods, we need to determine whether two subsets of samples come from the same cluster. Since these subsets are usually generated by certain clustering procedures, the deployment of classic two-sample tests in this context would yield extremely smaller p-values, leading to inflated Type-I error rate. To overcome this bias, we formally introduce the two-cluster test issue and argue that it is a totally different significance testing issue from conventional two-sample test. Meanwhile, we present a new method based on the boundary points between two subsets to derive an analytical p-value for the purpose of significance quantification. Experiments on both synthetic and real data sets show that the proposed test is able to significantly reduce the Type-I error rate, in comparison…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Anomaly Detection Techniques and Applications
