Calculating $p$-values and their significances with the Energy Test for large datasets
W. Barter, C. Burr, C. Parkes

TL;DR
This paper introduces a new scalable method for calculating p-values in the energy test, enabling efficient analysis of large datasets by scaling distributions from smaller samples.
Contribution
It proposes a novel approach to determine the null distribution of the energy test statistic for large samples by scaling from small sample distributions, improving computational efficiency.
Findings
The distribution of the test statistic is not well modeled by the generalized extreme value function.
A new scaling method accurately estimates p-values for large datasets.
The method enhances the energy test's applicability to big data scenarios.
Abstract
The energy test method is a multi-dimensional test of whether two samples are consistent with arising from the same underlying population, through the calculation of a single test statistic (called the -value). The method has recently been used in particle physics to search for differences between samples that arise from CP violation. The generalised extreme value function has previously been used to describe the distribution of -values under the null hypothesis that the two samples are drawn from the same underlying population. We show that, in a simple test case, the distribution is not sufficiently well described by the generalised extreme value function. We present a new method, where the distribution of -values under the null hypothesis when comparing two large samples can be found by scaling the distribution found when comparing small samples drawn from the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
