Data Collection and Quality Challenges in Deep Learning: A Data-Centric   AI Perspective

Steven Euijong Whang; Yuji Roh; Hwanjun Song; Jae-Gil Lee

arXiv:2112.06409·cs.LG·December 27, 2022·40 cites

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Steven Euijong Whang, Yuji Roh, Hwanjun Song, Jae-Gil Lee

PDF

Open Access

TL;DR

This paper surveys the challenges of data collection and quality in deep learning, emphasizing the importance of data-centric AI practices, data quality issues, and fairness considerations in modern machine learning workflows.

Contribution

It provides a comprehensive overview of data collection, quality, validation, cleaning, and fairness issues specific to deep learning, highlighting research directions and solutions.

Findings

01

Many real-world datasets are small, dirty, biased, or poisoned.

02

Robust training techniques can mitigate the impact of imperfect data.

03

Fairness and bias mitigation are critical and emerging topics in data management for AI.

Abstract

Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning