The Lean Data Scientist: Recent Advances towards Overcoming the Data Bottleneck
Chen Shani, Jonathan Zarecki, Dafna Shahaf

TL;DR
This paper reviews recent methods addressing the data bottleneck in machine learning, proposing a taxonomy to organize existing techniques and inspire more efficient data collection and annotation strategies.
Contribution
It introduces a comprehensive taxonomy of methods to overcome data scarcity in ML, aiming to unify and clarify the landscape for practitioners and researchers.
Findings
Provides a structured classification of data augmentation, synthesis, and transfer learning methods.
Highlights gaps and opportunities for future research in data-efficient ML.
Encourages community awareness and resource optimization in dataset creation.
Abstract
Machine learning (ML) is revolutionizing the world, affecting almost every field of science and industry. Recent algorithms (in particular, deep networks) are increasingly data-hungry, requiring large datasets for training. Thus, the dominant paradigm in ML today involves constructing large, task-specific datasets. However, obtaining quality datasets of such magnitude proves to be a difficult challenge. A variety of methods have been proposed to address this data bottleneck problem, but they are scattered across different areas, and it is hard for a practitioner to keep up with the latest developments. In this work, we propose a taxonomy of these methods. Our goal is twofold: (1) We wish to raise the community's awareness of the methods that already exist and encourage more efficient use of resources, and (2) we hope that such a taxonomy will contribute to our understanding of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
