Gaussian and Non-Gaussian Universality of Data Augmentation
Kevin Han Huang, Peter Orbanz, Morgane Austern

TL;DR
This paper investigates the theoretical effects of data augmentation on statistical estimates, revealing that its impact on uncertainty and regularization is complex and depends on multiple factors, including data distribution and model properties.
Contribution
It provides universality results for data augmentation's influence on variance and distribution, introducing an adapted Lindeberg technique for block dependence analysis.
Findings
Data augmentation can increase uncertainty in some cases.
It may act as a regularizer but not in all high-dimensional settings.
Augmentation effects depend on data distribution, estimator properties, and sample size.
Abstract
We provide universality results that quantify how data augmentation affects the variance and limiting distribution of estimates through simple surrogates, and analyze several specific models in detail. The results confirm some observations made in machine learning practice, but also lead to unexpected findings: Data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. It can act as a regularizer, but fails to do so in certain high-dimensional problems, and it may shift the double-descent peak of an empirical risk. Overall, the analysis shows that several properties data augmentation has been attributed with are not either true or false, but rather depend on a combination of factors -- notably the data distribution, the properties of the estimator, and the interplay of sample size, number of augmentations, and dimension. As…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Bayesian Methods and Mixture Models · Financial Risk and Volatility Modeling
