The More Data, the Better? Demystifying Deletion-Based Methods in Linear Regression with Missing Data
Tianchen Xu, Kun Chen, Gen Li

TL;DR
This paper compares complete-case and available-case deletion methods in linear regression with missing data, revealing that more data does not always mean better estimates and highlighting the influence of data patterns and structure.
Contribution
It provides a theoretical comparison of deletion-based methods, clarifies misconceptions about data usage, and offers simulation evidence on their asymptotic properties.
Findings
Available-case analysis does not always outperform complete-case analysis in efficiency.
Missing data patterns and covariance structures significantly influence method performance.
Both methods are asymptotically unbiased under certain conditions.
Abstract
We compare two deletion-based methods for dealing with the problem of missing observations in linear regression analysis. One is the complete-case analysis (CC, or listwise deletion) that discards all incomplete observations and only uses common samples for ordinary least-squares estimation. The other is the available-case analysis (AC, or pairwise deletion) that utilizes all available data to estimate the covariance matrices and applies these matrices to construct the normal equation. We show that the estimates from both methods are asymptotically unbiased and further compare their asymptotic variances in some typical situations. Surprisingly, using more data (i.e., AC) does not necessarily lead to better asymptotic efficiency in many scenarios. Missing patterns, covariance structure and true regression coefficient values all play a role in determining which is better. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Statistical Methods and Bayesian Inference · Bayesian Modeling and Causal Inference
