Recommending Training Set Sizes for Classification
Phillip Koshute, Jared Zook, Ian McCulloh

TL;DR
This paper provides practical recommendations for selecting training set sizes in classification tasks by analyzing 20 datasets, aiming to optimize model performance while reducing data collection costs.
Contribution
It introduces a systematic approach to determine sufficient training set sizes using confidence intervals and inverse power law learning curves, offering guidelines based on dataset characteristics.
Findings
Recommended training set sizes range from 3,000 to 30,000 data points.
Identified correlations between dataset features and optimal training sizes.
Proposed a method to estimate sufficient training set size (STSS) for various datasets.
Abstract
Based on a comprehensive study of 20 established data sets, we recommend training set sizes for any classification data set. We obtain our recommendations by systematically withholding training data and developing models through five different classification methods for each resulting training set. Based on these results, we construct accuracy confidence intervals for each training set size and fit the lower bounds to inverse power low learning curves. We also estimate a sufficient training set size (STSS) for each data set based on established convergence criteria. We compare STSS to the data sets' characteristics; based on identified trends, we recommend training set sizes between 3000 and 30000 data points, according to a data set's number of classes and number of features. Because obtaining and preparing training data has non-negligible costs that are proportional to data set size,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Machine Learning and Data Classification · Topic Modeling
