Listwise Deletion in High Dimensions
J. Sophia Wang, Peter M. Aronow

TL;DR
This paper analyzes the limitations of listwise deletion in high-dimensional data, showing that it often results in dropping all data rows as the number of variables grows, which can severely limit data utility.
Contribution
It provides theoretical insights into the behavior of listwise deletion in high dimensions and illustrates potential practical issues with real datasets.
Findings
Listwise deletion drops all rows with high probability when variables grow superlogarithmically in sample size.
In real datasets, listwise deletion can lead to using very few variables due to missing data.
Theoretical results highlight limitations of common missing data handling in high-dimensional settings.
Abstract
We consider the properties of listwise deletion when both and the number of variables grow large. We show that when (i) all data has some idiosyncratic missingness and (ii) the number of variables grows superlogarithmically in , then, for large , listwise deletion will drop all rows with probability 1. Using two canonical datasets from the study of comparative politics and international relations, we provide numerical illustration that these problems may emerge in real world settings. These results suggest, in practice, using listwise deletion may mean using few of the variables available to the researcher.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGame Theory and Voting Systems · Survey Sampling and Estimation Techniques
