Choosing good subsamples for regression modelling
Thomas Lumley, Tong Chen

TL;DR
This paper addresses the challenge of selecting optimal subsamples for regression modeling in large health datasets, emphasizing two-phase sampling strategies and influence functions to improve estimation accuracy.
Contribution
It introduces a framework using influence functions for designing subsamples in two-phase regression models, including adaptive multiwave designs and prior information integration.
Findings
Influence functions unify design and estimation in subsampling.
Adaptive multiwave designs improve efficiency.
Discussion on the information gap between estimators.
Abstract
A common problem in health research is that we have a large database with many variables measured on a large number of individuals. We are interested in measuring additional variables on a subsample; these measurements may be newly available, or expensive, or simply not considered when the data were first collected. The intended use for the new measurements is to fit a regression model generalisable to the whole cohort (and to its source population). This is a two-phase sampling problem; it differs from some other two-phase sampling problems in the richness of the phase I data and in the goal of regression modelling. In particular, an important special case is measurement-error models, where a variable strongly correlated with the phase II measurements is available at phase I. We will explain how influence functions have been useful as a unifying concept for extending classical results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Bayesian Inference · Advanced Causal Inference Techniques
