Computational Social Scientist Beware: Simpson's Paradox in Behavioral   Data

Kristina Lerman

arXiv:1710.08615·cs.SI·December 16, 2022

Computational Social Scientist Beware: Simpson's Paradox in Behavioral Data

Kristina Lerman

PDF

TL;DR

This paper highlights how heterogeneity in behavioral data can cause Simpson's paradox, leading to misleading aggregate analysis, and proposes a simple test to detect its presence in social data studies.

Contribution

It demonstrates the impact of Simpson's paradox in behavioral data and introduces a straightforward method to identify its influence on analysis results.

Findings

01

Aggregate data can misrepresent underlying subgroup behaviors.

02

Simpson's paradox can lead to incorrect conclusions in social data analysis.

03

A simple test can detect the presence of Simpson's paradox in datasets.

Abstract

Observational data about human behavior is often heterogeneous, i.e., generated by subgroups within the population under study that vary in size and behavior. Heterogeneity predisposes analysis to Simpson's paradox, whereby the trends observed in data that has been aggregated over the entire population may be substantially different from those of the underlying subgroups. I illustrate Simpson's paradox with several examples coming from studies of online behavior and show that aggregate response leads to wrong conclusions about the underlying individual behavior. I then present a simple method to test whether Simpson's paradox is affecting results of analysis. The presence of Simpson's paradox in social data suggests that important behavioral differences exist within the population, and failure to take these differences into account can distort the studies' findings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.