Data Cleaning and Machine Learning: A Systematic Literature Review
Pierre-Olivier C\^ot\'e, Amin Nikanjam, Nafisa Ahmed, Dmytro Humeniuk,, Foutse Khomh

TL;DR
This systematic review analyzes recent research from 2016 to 2022 on how data cleaning techniques are applied to improve machine learning models and how ML methods are used for data cleaning, highlighting promising techniques and future directions.
Contribution
The paper provides a comprehensive review of 101 studies on data cleaning for ML and ML for data cleaning, identifying key techniques and proposing future research directions.
Findings
Summarizes 101 papers on data cleaning activities for ML.
Identifies promising data cleaning techniques for future development.
Provides 24 recommendations for future research in the field.
Abstract
Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches to detect and repair data errors (i.e., data cleaning). Researchers are also exploring how ML can be used for data cleaning; hence creating a dual relationship between ML and data cleaning. To the best of our knowledge, there is no study that comprehensively reviews this relationship. Objective: This paper's objectives are twofold. First, it aims to summarize the latest approaches for data cleaning for ML and ML for data cleaning. Second, it provides future work recommendations. Method: We conduct a systematic literature review of the papers published between 2016 and 2022 inclusively. We identify different types of data cleaning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data
