A Survey on Data Collection for Machine Learning: a Big Data -- AI Integration Perspective
Yuji Roh, Geon Heo, Steven Euijong Whang

TL;DR
This survey reviews data collection challenges in machine learning, emphasizing the integration of data management and AI, and discusses techniques, challenges, and future research directions in large-scale data acquisition and labeling.
Contribution
It provides a comprehensive overview of data collection methods from a data management perspective, highlighting the integration with AI and identifying key research challenges.
Findings
Data collection is a critical bottleneck in ML and AI.
Various techniques exist for data acquisition and labeling.
Integration of data management and AI opens new research opportunities.
Abstract
Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Mobile Crowdsensing and Crowdsourcing · Machine Learning and Data Classification
