Choosing good subsamples for regression modelling

Thomas Lumley; Tong Chen

arXiv:2203.10701·stat.ME·March 22, 2022

Choosing good subsamples for regression modelling

Thomas Lumley, Tong Chen

PDF

Open Access

TL;DR

This paper addresses the challenge of selecting optimal subsamples for regression modeling in large health datasets, emphasizing two-phase sampling strategies and influence functions to improve estimation accuracy.

Contribution

It introduces a framework using influence functions for designing subsamples in two-phase regression models, including adaptive multiwave designs and prior information integration.

Findings

01

Influence functions unify design and estimation in subsampling.

02

Adaptive multiwave designs improve efficiency.

03

Discussion on the information gap between estimators.

Abstract

A common problem in health research is that we have a large database with many variables measured on a large number of individuals. We are interested in measuring additional variables on a subsample; these measurements may be newly available, or expensive, or simply not considered when the data were first collected. The intended use for the new measurements is to fit a regression model generalisable to the whole cohort (and to its source population). This is a two-phase sampling problem; it differs from some other two-phase sampling problems in the richness of the phase I data and in the goal of regression modelling. In particular, an important special case is measurement-error models, where a variable strongly correlated with the phase II measurements is available at phase I. We will explain how influence functions have been useful as a unifying concept for extending classical results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Bayesian Inference · Advanced Causal Inference Techniques