IT-OSE: Exploring Optimal Sample Size for Industrial Data Augmentation
Mingchun Sun, Rongqiang Zhao, Zhennan Huang, Songyu Ding, and Jie Liu

TL;DR
This paper introduces IT-OSE, an information-theoretic method for accurately estimating the optimal sample size in industrial data augmentation, improving model performance and stability while reducing costs.
Contribution
It proposes a novel theoretical framework and metric for estimating and evaluating the optimal sample size in industrial data augmentation tasks.
Findings
IT-OSE improves classification accuracy by 4.38%
Reduces MAPE in regression tasks by 18.80%
Decreases computational costs by 83.97% compared to exhaustive search
Abstract
In industrial scenarios, data augmentation is an effective approach to improve model performance. However, its benefits are not unidirectionally beneficial. There is no theoretical research or established estimation for the optimal sample size (OSS) in augmentation, nor is there an established metric to evaluate the accuracy of OSS or its deviation from the ground truth. To address these issues, we propose an information-theoretic optimal sample size estimation (IT-OSE) to provide reliable OSS estimation for industrial data augmentation. An interval coverage and deviation (ICD) score is proposed to evaluate the estimated OSS intuitively. The relationship between OSS and dominant factors is theoretically analyzed and formulated, thereby enhancing the interpretability. Experiments show that, compared to empirical estimation, the IT-OSE increases accuracy in classification tasks across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Data Stream Mining Techniques · Machine Learning and Data Classification
