Two-phase validation sampling via principal components to improve efficiency in multi-model estimation from error-prone biomedical databases

Sarah C. Lotspeich; Cole Manschot

arXiv:2512.02182·stat.ME·May 21, 2026

Two-phase validation sampling via principal components to improve efficiency in multi-model estimation from error-prone biomedical databases

Sarah C. Lotspeich, Cole Manschot

PDF

TL;DR

This paper introduces a principal component-based two-phase sampling method to efficiently validate error-prone data in biomedical databases, improving multi-model estimation accuracy.

Contribution

It extends extreme tail sampling by using principal components to balance multiple models, enhancing efficiency in validation studies with multiple exposures.

Findings

01

Strategy improves efficiency across multiple models in simulations.

02

Application to NHANES data demonstrates practical benefits.

03

Method remains effective with correlated or heterogeneous errors.

Abstract

Two-phase sampling offers a cost-effective way to validate error-prone covariate measurements in biomedical databases. Inexpensive or easy-to-obtain information is collected for the entire study in Phase I. Then, a subset of patients undergoes cost-intensive validation (e.g., expert chart review) to collect more accurate data in Phase II. When balancing primary and secondary analyses, competing models and priorities can result in poorly defined objectives for the most informative Phase II sampling criterion. Extreme tail sampling (ETS), wherein patients with the smallest and largest values of a particular quantity (like a covariate or residual) are selected, can offer great statistical efficiency in two-phase studies when focusing on a single analytic objective by targeting observations with the biggest contributions to the Fisher information. We propose an intuitive, easy-to-use…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Bayesian Inference · Advanced Causal Inference Techniques · Statistical Methods and Inference