Pre-validation Revisited

Jing Shang; Sourav Chatterjee; Trevor Hastie; Robert Tibshirani

arXiv:2505.14985·stat.ME·May 23, 2025

Pre-validation Revisited

Jing Shang, Sourav Chatterjee, Trevor Hastie, Robert Tibshirani

PDF

Open Access

TL;DR

This paper revisits pre-validation for prediction models with datasets of different feature dimensions, providing new analytical and bootstrap methods for inference, and demonstrating their effectiveness through simulations and real-world applications.

Contribution

It extends pre-validation theory without independence assumptions and introduces an analytical distribution and bootstrap approach for inference.

Findings

01

Proposed an analytical distribution for the pre-validated predictor's test statistic.

02

Developed a bootstrap procedure for inference in pre-validation.

03

Validated methods through simulations and real data applications.

Abstract

Pre-validation is a way to build prediction model with two datasets of significantly different feature dimensions. Previous work showed that the asymptotic distribution of the resulting test statistic for the pre-validated predictor deviates from a standard Normal, hence leads to issues in hypothesis testing. In this paper, we revisit the pre-validation procedure and extend the problem formulation without any independence assumption on the two feature sets. We propose not only an analytical distribution of the test statistic for the pre-validated predictor under certain models, but also a generic bootstrap procedure to conduct inference. We show properties and benefits of pre-validation in prediction, inference and error estimation by simulations and applications, including analysis of a breast cancer study and a synthetic GWAS example.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Statistical Methods and Inference · Generative Adversarial Networks and Image Synthesis