Diffuse and Disperse: Image Generation with Representation Regularization
Runqian Wang, Kaiming He

TL;DR
This paper introduces Dispersive Loss, a simple regularizer for diffusion models that disperses internal representations, improving generative quality without extra data or pre-training, bridging generative modeling and representation learning.
Contribution
It proposes Dispersive Loss, a minimalist, plug-and-play regularizer that enhances diffusion models by promoting representation dispersion without requiring positive pairs or external data.
Findings
Consistent improvements on ImageNet across various models
No need for pre-training or external data
Enhances diffusion models by regularizing internal representations
Abstract
The development of diffusion-based generative models over the past decade has largely proceeded independently of progress in representation learning. These diffusion models typically rely on regression-based objectives and generally lack explicit regularization. In this work, we propose \textit{Dispersive Loss}, a simple plug-and-play regularizer that effectively improves diffusion-based generative models. Our loss function encourages internal representations to disperse in the hidden space, analogous to contrastive self-supervised learning, with the key distinction that it requires no positive sample pairs and therefore does not interfere with the sampling process used for regression. Compared to the recent method of representation alignment (REPA), our approach is self-contained and minimalist, requiring no pre-training, no additional parameters, and no external data. We evaluate…
Peer Reviews
Decision·Submitted to ICLR 2026
- The idea of removing positive pairs while retaining the repulsive regularization aspect is conceptually appealing and practically justified by diffusion models’ intrinsic alignment objective. - Comprehensive experiments across multiple architectures (DiT, SiT, MeanFlow) and scales (S/B/L/XL) show consistent improvements in FID and Inception Scores. - The improvement trend scales with model size, indicating the loss acts as an effective regularizer for large-capacity models prone to overfitting
- The method is motivated intuitively but lacks a formal analysis of why dispersion improves generation quality. A deeper information-theoretic or geometric argument (e.g., on latent coverage or mutual information bounds) would strengthen the theoretical grounding. - While FID and Inception Scores are strong indicators, evaluation on semantic diversity, perceptual similarity, or representation quality (e.g., CLIP-based metrics) could better reveal what aspects of representation regularization im
1. The proposed regularizer can be directly integrated into diffusion models with intermediate representations and requires little additional computational effort. 2. It elegantly incorporates concepts from self-supervised learning into diffusion model training in a straightforward and theoretically sound manner. 3. Experiments on a real-world image dataset demonstrate that adding the regularizer significantly improves performance, and the experimental results are comprehensive.
1. It would be helpful to clarify the scope of applicability. Can the proposed regularizer be applied to all diffusion models with intermediate representations? 2. Qualitative comparisons between images generated with and without the proposed regularizer would make the improvements more intuitive and visually convincing. 3. Although experiments explore different blocks, loss weights, and temperatures, it would be beneficial to provide systematic guidance or heuristics for selecting these hyper
1. The authors aim to address a fundamental problem, namely that the generation task should not be left to stand alone with representation learning, and offer insightful perspectives. 2. The paper features a clear structure and coherent logic. 3. The proposed method does not rely on a pretrained encoder.
1. The authors used limited evaluation metrics. As they don't rely on a pretrained encoder and claim the importance of representation learning in the generative task, the authors should evaluate how the method performs with metrics like linear probing. 2. It's not clear why the major improvements were made in the case without CFG, while the performance with CFG only achieves very limited improvements. The authors should provide deeper analyses to explain this discrepancy and not rely only on FI
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsDiffusion
