The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models
Flavio Nicoletti, Chenxiao Ma, Enrico Ventura, Luca Saglietti, Stefano Sarao Mannelli

TL;DR
This paper develops a high-dimensional analytical framework to understand how class heterogeneity and sampling imbalance influence the learning and memorization dynamics of diffusion models, revealing hierarchy and delay effects.
Contribution
It introduces a novel theoretical analysis of class-dependent learning in diffusion models, highlighting the impact of class variance and imbalance on training dynamics.
Findings
Class variance determines learning order, favoring higher-variance classes.
Sampling imbalance can reverse class learning order and delay minority class learning.
Empirical validation on Fashion MNIST confirms theoretical predictions.
Abstract
Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transition from an initial phase of generalization to memorizing the training set, existing theory assumes homogeneous data, leaving open how class imbalance and heterogeneity reshape these dynamics. In this work, we develop a high-dimensional analytical framework to study class-dependent learning in score-based diffusion models. Analyzing a random-features model trained on Gaussian mixtures, we derive the feature-covariance spectrum to characterize per-class generalization and memorization times. We reveal the explicit hierarchy governing these dynamics: class variance is the primary determinant of learning order-consistently favoring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
