Computational Social Scientist Beware: Simpson's Paradox in Behavioral Data
Kristina Lerman

TL;DR
This paper highlights how heterogeneity in behavioral data can cause Simpson's paradox, leading to misleading aggregate analysis, and proposes a simple test to detect its presence in social data studies.
Contribution
It demonstrates the impact of Simpson's paradox in behavioral data and introduces a straightforward method to identify its influence on analysis results.
Findings
Aggregate data can misrepresent underlying subgroup behaviors.
Simpson's paradox can lead to incorrect conclusions in social data analysis.
A simple test can detect the presence of Simpson's paradox in datasets.
Abstract
Observational data about human behavior is often heterogeneous, i.e., generated by subgroups within the population under study that vary in size and behavior. Heterogeneity predisposes analysis to Simpson's paradox, whereby the trends observed in data that has been aggregated over the entire population may be substantially different from those of the underlying subgroups. I illustrate Simpson's paradox with several examples coming from studies of online behavior and show that aggregate response leads to wrong conclusions about the underlying individual behavior. I then present a simple method to test whether Simpson's paradox is affecting results of analysis. The presence of Simpson's paradox in social data suggests that important behavioral differences exist within the population, and failure to take these differences into account can distort the studies' findings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
