A Primer on the Data Cleaning Pipeline
Rebecca C. Steorts

TL;DR
This paper reviews the emerging field of data cleaning pipelines, detailing the four stages involved in preparing diverse and rapidly updated data sources for analysis.
Contribution
It introduces technical terminology and summarizes common methods used in the data cleaning pipeline, an area gaining importance with expanding data sources.
Findings
Defines the four stages of data cleaning pipelines
Summarizes key methods for data integration and cleaning
Highlights the importance of standardized procedures
Abstract
The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this expansion, the statistical and methodological questions around data integration, or rather merging multiple data sources, has also grown. Specifically, the science of the ``data cleaning pipeline'' contains four stages that allow an analyst to perform downstream tasks, predictive analyses, or statistical analyses on ``cleaned data.'' This article provides a review of this emerging field, introducing technical terminology and commonly used methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data
