Two-cluster test

Xinying Liu; Lianyu Hu; Mudi Jiang; Simeng Zhang; Jun Lou; and Zengyou He

arXiv:2507.08382·cs.LG·July 15, 2025

Two-cluster test

Xinying Liu, Lianyu Hu, Mudi Jiang, Simeng Zhang, Jun Lou, and Zengyou He

PDF

Open Access

TL;DR

This paper introduces a novel two-cluster test to accurately determine if two data subsets originate from the same cluster, addressing inflated error rates of traditional tests in clustering contexts.

Contribution

It presents a new significance testing method based on boundary points, reducing Type-I errors and applicable to interpretability and hierarchical clustering.

Findings

01

Significantly reduces Type-I error rate compared to classic tests

02

Effective in synthetic and real data experiments

03

Applicable in tree-based interpretable clustering

Abstract

Cluster analysis is a fundamental research issue in statistics and machine learning. In many modern clustering methods, we need to determine whether two subsets of samples come from the same cluster. Since these subsets are usually generated by certain clustering procedures, the deployment of classic two-sample tests in this context would yield extremely smaller p-values, leading to inflated Type-I error rate. To overcome this bias, we formally introduce the two-cluster test issue and argue that it is a totally different significance testing issue from conventional two-sample test. Meanwhile, we present a new method based on the boundary points between two subsets to derive an analytical p-value for the purpose of significance quantification. Experiments on both synthetic and real data sets show that the proposed test is able to significantly reduce the Type-I error rate, in comparison…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Anomaly Detection Techniques and Applications