A Proposal to Study "Is High Quality Data All We Need?"
Swaroop Mishra, Anjana Arunkumar

TL;DR
This paper investigates whether smaller, high-quality datasets can replace large datasets for training deep neural networks, challenging the assumption that more data is always better for model performance and robustness.
Contribution
It proposes an empirical study to evaluate the effectiveness of high-quality data selection and creation methods as alternatives to large-scale datasets.
Findings
Preliminary evidence suggests high-quality data can improve model robustness.
High-quality datasets may reduce training time and computational resources.
Potential to shift focus from data quantity to data quality in deep learning.
Abstract
Even though deep neural models have achieved superhuman performance on many popular benchmarks, they have failed to generalize to OOD or adversarial datasets. Conventional approaches aimed at increasing robustness include developing increasingly large models and augmentation with large scale datasets. However, orthogonal to these trends, we hypothesize that a smaller, high quality dataset is what we need. Our hypothesis is based on the fact that deep neural networks are data driven models, and data is what leads/misleads models. In this work, we propose an empirical study that examines how to select a subset of and/or create high quality benchmark data, for a model to learn effectively. We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets. We plan to investigate both data pruning and data creation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Machine Learning and Data Classification
MethodsPruning
