Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science
Stephen Kasica, Charles Berret, and Tamara Munzner

TL;DR
This paper explores the unique challenges of data preparation in journalism, comparing it to data science workflows, and introduces a taxonomy of dirty data issues specific to newsrooms.
Contribution
It extends existing data science models to include journalism-specific data preparation activities and identifies key challenges faced by data journalists.
Findings
Identified 60 dirty data issues from multiple taxonomies.
Developed a novel taxonomy based on discrepancies between mental models.
Highlighted four major challenges: diachronic, regional, fragmented, and disparate data sources.
Abstract
The work involved in gathering, wrangling, cleaning, and otherwise preparing data for analysis is often the most time consuming and tedious aspect of data work. Although many studies describe data preparation within the context of data science workflows, there has been little research on data preparation in data journalism. We address this gap with a hybrid form of thematic analysis that combines deductive codes derived from existing accounts of data science workflows and inductive codes arising from an interview study with 36 professional data journalists. We extend a previous model of data science work to incorporate detailed activities of data preparation. We synthesize 60 dirty data issues from 16 taxonomies on dirty data and our interview data, and we provide a novel taxonomy to characterize these dirty data issues as discrepancies between mental models. We also identify four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
