Memorisation, convergence and generalisation in generative models

Antoine Maillard; Sebastian Goldt

arXiv:2605.21402·stat.ML·May 21, 2026

Memorisation, convergence and generalisation in generative models

Antoine Maillard, Sebastian Goldt

PDF

TL;DR

This paper analytically investigates how generative models transition from memorising training data to generalising, revealing that convergence and latent factor recovery are distinct objectives with different data requirements.

Contribution

It provides an exact analytical characterization of the memorisation-generalisation transition in linear generative models and extends findings to convolutional denoisers and real data.

Findings

01

Models memorize at low data load

02

Convergence occurs when sample size is linear in input dimension

03

Convergence is insensitive to latent factor recovery

Abstract

Generative neural networks learn how to produce highly realistic images from a large, but finite number of examples - or do they simply memorise their training set? To settle this question, Kadkhodaie, Guth, Simoncelli and Mallat (ICLR '24) trained diffusion models independently on disjoint subsets of a dataset and showed that they converge to nearly the same density when the number of training images is large enough. This result raises two basic questions: how much data do you need for convergence, and what does convergence capture about learning the data distribution? Here, we address these questions by providing an exact analytical characterisation of the transition from memorisation to generalisation in linear generative models. We find that these models memorise at small load, while convergence emerges continuously when the number of samples is linear in the input dimension.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.