Causal Data Integration
Brit Youngmann, Michael Cafarella, Babak Salimi, and Anna Zeng

TL;DR
This paper introduces the Causal Data Integration problem, aiming to automatically incorporate unobserved attributes from external sources and construct causal DAGs to improve causal inference accuracy.
Contribution
It defines the CDI problem, identifies key challenges, and proposes a system architecture for automatic causal data integration from external sources.
Findings
Preliminary results show feasibility of solving CDI.
System architecture outlines key components for CDI.
Future research directions identified for improving causal inference.
Abstract
Causal inference is fundamental to empirical scientific discoveries in natural and social sciences; however, in the process of conducting causal inference, data management problems can lead to false discoveries. Two such problems are (i) not having all attributes required for analysis, and (ii) misidentifying which attributes are to be included in the analysis. Analysts often only have access to partial data, and they critically rely on (often unavailable or incomplete) domain knowledge to identify attributes to include for analysis, which is often given in the form of a causal DAG. We argue that data management techniques can surmount both of these challenges. In this work, we introduce the Causal Data Integration (CDI) problem, in which unobserved attributes are mined from external sources and a corresponding causal DAG is automatically built. We identify key challenges and research…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Causal Data Integration· youtube
Taxonomy
TopicsBayesian Modeling and Causal Inference · Geochemistry and Geologic Mapping · Data Quality and Management
