Pre-validation Revisited
Jing Shang, Sourav Chatterjee, Trevor Hastie, Robert Tibshirani

TL;DR
This paper revisits pre-validation for prediction models with datasets of different feature dimensions, providing new analytical and bootstrap methods for inference, and demonstrating their effectiveness through simulations and real-world applications.
Contribution
It extends pre-validation theory without independence assumptions and introduces an analytical distribution and bootstrap approach for inference.
Findings
Proposed an analytical distribution for the pre-validated predictor's test statistic.
Developed a bootstrap procedure for inference in pre-validation.
Validated methods through simulations and real data applications.
Abstract
Pre-validation is a way to build prediction model with two datasets of significantly different feature dimensions. Previous work showed that the asymptotic distribution of the resulting test statistic for the pre-validated predictor deviates from a standard Normal, hence leads to issues in hypothesis testing. In this paper, we revisit the pre-validation procedure and extend the problem formulation without any independence assumption on the two feature sets. We propose not only an analytical distribution of the test statistic for the pre-validated predictor under certain models, but also a generic bootstrap procedure to conduct inference. We show properties and benefits of pre-validation in prediction, inference and error estimation by simulations and applications, including analysis of a breast cancer study and a synthetic GWAS example.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Statistical Methods and Inference · Generative Adversarial Networks and Image Synthesis
