A Primer on the Data Cleaning Pipeline

Rebecca C. Steorts

arXiv:2307.13219·cs.DB·July 26, 2023

A Primer on the Data Cleaning Pipeline

Rebecca C. Steorts

PDF

Open Access

TL;DR

This paper reviews the emerging field of data cleaning pipelines, detailing the four stages involved in preparing diverse and rapidly updated data sources for analysis.

Contribution

It introduces technical terminology and summarizes common methods used in the data cleaning pipeline, an area gaining importance with expanding data sources.

Findings

01

Defines the four stages of data cleaning pipelines

02

Summarizes key methods for data integration and cleaning

03

Highlights the importance of standardized procedures

Abstract

The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this expansion, the statistical and methodological questions around data integration, or rather merging multiple data sources, has also grown. Specifically, the science of the ``data cleaning pipeline'' contains four stages that allow an analyst to perform downstream tasks, predictive analyses, or statistical analyses on ``cleaned data.'' This article provides a review of this emerging field, introducing technical terminology and commonly used methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Privacy-Preserving Technologies in Data