Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective
Steven Euijong Whang, Yuji Roh, Hwanjun Song, Jae-Gil Lee

TL;DR
This paper surveys the challenges of data collection and quality in deep learning, emphasizing the importance of data-centric AI practices, data quality issues, and fairness considerations in modern machine learning workflows.
Contribution
It provides a comprehensive overview of data collection, quality, validation, cleaning, and fairness issues specific to deep learning, highlighting research directions and solutions.
Findings
Many real-world datasets are small, dirty, biased, or poisoned.
Robust training techniques can mitigate the impact of imperfect data.
Fairness and bias mitigation are critical and emerging topics in data management for AI.
Abstract
Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning
