Evaluation of the number of clusters in a data set using $p$-values from Multiple Tests of Hypotheses

Soumita Modak

arXiv:2605.20806·stat.ME·May 21, 2026

Evaluation of the number of clusters in a data set using $p$-values from Multiple Tests of Hypotheses

Soumita Modak

PDF

TL;DR

This paper introduces a new nonparametric, distance-based measure to determine the number of clusters in a dataset, leveraging multiple hypothesis tests and p-value combination, applicable to high-dimensional data.

Contribution

It presents a novel cluster accuracy index that efficiently estimates the number of groups using multiple hypothesis testing without extensive computation.

Findings

01

Demonstrates the index's efficiency through data studies.

02

Shows superiority over existing methods in accuracy.

03

Applicable to arbitrary-dimensional datasets.

Abstract

This paper proposes a novel, nonparametric, interpoint distance-based measure to investigate whether there exist any groups in a set of given data, and if so then, how many groups are prevailing in total. It is a cluster accuracy index useful for arbitrary-dimensional data set, in association with any clustering algorithm having the number of groups specified as a priori. We perform univariate, nonparametric, multiple statistical tests of hypotheses, where as many dependent tests as the sample size are carried out using the interpoint distances. They possess $p$ -values to be combined to reach a decision, which is taken in a step-wise process for a possible number of clusters. It reduces the unnecessary computations compared with the other accuracy measures from the literature. Data study establishes the proposed index's efficiency and superiority.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.