A Theory for Conditional Generative Modeling on Multiple Data Sources
Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu

TL;DR
This paper provides a theoretical framework for understanding how multi-source data improves conditional generative models, showing that shared similarities among sources lead to sharper estimation bounds and better performance.
Contribution
It introduces the first rigorous analysis of multi-source training in conditional generative modeling, establishing error bounds and characterizing benefits of source similarity.
Findings
Multi-source training can outperform single-source under shared source similarities.
Theoretical bounds depend on the number of sources and their distribution similarities.
Experiments validate the theoretical advantages of multi-source over single-source training.
Abstract
The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper takes the first step toward a rigorous analysis of multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specifically, we establish a general distribution estimation error bound in average total variation distance for conditional maximum likelihood estimation based on the bracketing number. Our result shows that when source distributions share certain similarities and the model is expressive enough, multi-source training guarantees a sharper bound than single-source training. We further instantiate the general theory on conditional Gaussian estimation and deep generative models including autoregressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Database Systems and Queries
