Why Empirical p-Values Are Not Uniform: Reference Samples, Dependence, and PIT Backtesting

Jakub Lis

arXiv:2605.16221·stat.ME·May 18, 2026

Why Empirical p-Values Are Not Uniform: Reference Samples, Dependence, and PIT Backtesting

Jakub Lis

PDF

TL;DR

This paper reveals that empirical p-values and PITs, when estimated from finite samples, deviate from uniformity due to dependence and variance distortions, impacting calibration assessment methods.

Contribution

It demonstrates how common implementations alter the statistical properties of empirical p-values, necessitating revised calibration techniques.

Findings

01

Empirical p-values are not uniformly distributed under finite samples.

02

Dependence and variance distortions invalidate classical uniformity tests.

03

Backtesting procedures need adjustments to account for two-stage sampling.

Abstract

Probability integral transforms (PITs) and empirical $p$ -values are widely used to assess the calibration of predictive distributions. While exact PIT values are uniformly distributed under correct model specification, practical implementations rely on empirical estimates constructed from finite samples. We show that this estimation step fundamentally alters the statistical structure of the problem. In particular, common-sample and rolling-window implementations introduce dependence and variance distortions that invalidate classical one-sample uniformity tests. When empirical percentiles are conditioned on a shared reference sample, the resulting statistics converge towards a two-sample Kolmogorov--Smirnov regime, while rolling windows induce autocorrelation and variance suppression. Our findings indicate that treating empirical percentiles as independent uniform draws can distort…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.