Experimental Design Issues in Big Data. The Question of Bias

Elena Pesce; Eva Riccomagno; Henry P. Wynn

arXiv:1712.06916·stat.ME·November 21, 2018

Experimental Design Issues in Big Data. The Question of Bias

Elena Pesce, Eva Riccomagno, Henry P. Wynn

PDF

TL;DR

This paper discusses the challenges of bias and confounding in big data collection, especially from passive sources like social media, and reviews solutions such as randomization to address these issues.

Contribution

It highlights the specific issues of bias in big data and evaluates methods like randomization to mitigate these problems in causal inference.

Findings

01

Bias and confounders can distort causal analysis in big data.

02

Randomization and other methods can help reduce bias.

03

Passive data collection poses unique challenges for causal studies.

Abstract

Data can be collected in scientific studies via a controlled experiment or passive observation. Big data is often collected in a passive way, e.g. from social media. In studies of causation great efforts are made to guard against bias and hidden confounders or feedback which can destroy the identification of causation by corrupting or omitting counterfactuals (controls). Various solutions of these problems are discussed, including randomization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.