Finding Non-Redundant Simpson's Paradox from Multidimensional Data
Yi Yang, Jian Pei, Jun Yang, and Jichun Xie

TL;DR
This paper introduces a novel framework for efficiently discovering non-redundant instances of Simpson's paradox in large multidimensional datasets, addressing redundancy issues that hinder previous methods.
Contribution
It formalizes types of redundancy in Simpson's paradox, proposes a concise representation framework, and develops algorithms that significantly improve detection efficiency and interpretability.
Findings
Redundant paradoxes can constitute over 40% of all detected paradoxes.
Algorithms reduce runtime by up to 60% on large datasets.
The framework scales to millions of records and maintains robustness under data perturbation.
Abstract
Simpson's paradox, a long-standing statistical phenomenon, describes the reversal of an observed association when data are disaggregated into sub-populations. It has critical implications across statistics, epidemiology, economics, and causal inference. Existing methods for detecting Simpson's paradox overlook a key issue: many paradoxes are redundant, arising from equivalent selections of data subsets, identical partitioning of sub-populations, and correlated outcome variables, which obscure essential patterns and inflate computational cost. In this paper, we present the first framework for discovering non-redundant Simpson's paradoxes. We formalize three types of redundancy - sibling child, separator, and statistic equivalence - and show that redundancy forms an equivalence relation. Leveraging this insight, we propose a concise representation framework for systematically organizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Advanced Causal Inference Techniques · Statistical Methods and Bayesian Inference
