Missing data and cluster graphs: cluster-level missingness vs variable-level missingness
Willow Scott, Eugenio Valdano, Charles Assaad

TL;DR
This paper explores how cluster-level missingness models can be used for data recovery and causal inference when detailed variable-level missingness information is unavailable.
Contribution
It introduces two classes of cluster-based missingness graphs, formalizes their compatibility with variable-level models, and provides graphical conditions for recoverability.
Findings
Cluster-level missingness can sometimes suffice for data recovery.
Graphical conditions are established for recovering joint distributions.
Conditions are identified for recovering macro causal effects.
Abstract
Missing data is pervasive in many scientific domains such as public health, environmental science, and the social sciences. Recoverability from missing data is typically studied using fully specified variable-level missingness models despite that, in many applications, only coarse structural information is available, for instance when variables are grouped into clusters due to limited knowledge or interpretability reasons. In this paper, we investigate recoverability from such abstract representations. We introduce two classes of cluster-based missingness graphs: the m-C-DMG, which retains variable-specific missingness indicators, and the cm-C-DMG, which aggregates missingness mechanisms at the cluster level. We formalize the notion of compatibility between these abstract graphs and underlying variable-level missingness models, and study how this abstraction affects the recoverability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
