Synthetic Data, Information, and Prior Knowledge: Why Synthetic Data Augmentation to Boost Sample Doesn't Work for Statistical Inference
Reid Dale, Jordan Rodu, Mike Baiocchi

TL;DR
This paper critically examines the limitations of synthetic data augmentation in statistical inference, highlighting its epistemic challenges, bounds on information gain, and the importance of justifiable priors for effective use.
Contribution
It offers a formal definition of synthetic distributions, analyzes their impact on inference, and discusses the epistemic constraints and necessary conditions for their valid application.
Findings
Synthetic data augmentation has fundamental informational bounds.
Naive prior specifications in synthetic augmentation are epistemically unjustifiable.
Augmentation can constrain model space without encoding complex constraints.
Abstract
The use of synthetic data to deidentify data and to improve predictive models is well-attested to. The augmentation of datasets using synthetically generated data is an alluring proposition: in the best case, it generates realistic data \textit{in silico} at a fraction of the cost of authentic data which may be found \textit{in vivo} or \textit{in vitro}. This poses novel epistemic challenges. We contend that synthetic data augmentation is best understood as a novel way of accounting for prior knowledge. In this manuscript, we propose a definition of synthetic distributions and analyze how synthetic data augmentation interplays with standard accounts of maximum likelihood and Bayesian estimation. We observe that the marginal Fisher information contributed by synthetic data processes is subject to fundamental bounds, and enumerate obstacles to the use of synthetic data augmentation to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Advanced Causal Inference Techniques · Statistical Methods and Bayesian Inference
