Evaluating and Crafting Datasets Effective for Deep Learning With Data Maps
Jay Bishnu, Andrew Gondoputro

TL;DR
This paper proposes a method for creating smaller, high-quality datasets for deep learning by selecting samples based on their difficulty level, maintaining model accuracy while reducing resource requirements.
Contribution
It introduces a novel dataset curation approach that focuses on sample difficulty to optimize training efficiency and effectiveness.
Findings
Smaller datasets curated by difficulty can match large dataset performance.
The method reduces training time and resource consumption.
Improves dataset quality assessment for deep learning models.
Abstract
Rapid development in deep learning model construction has prompted an increased need for appropriate training data. The popularity of large datasets - sometimes known as "big data" - has diverted attention from assessing their quality. Training on large datasets often requires excessive system resources and an infeasible amount of time. Furthermore, the supervised machine learning process has yet to be fully automated: for supervised learning, large datasets require more time for manually labeling samples. We propose a method of curating smaller datasets with comparable out-of-distribution model accuracy after an initial training session using an appropriate distribution of samples classified by how difficult it is for a model to learn from them.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Time Series Analysis and Forecasting · Image Processing and 3D Reconstruction
