Diffuse and Disperse: Image Generation with Representation Regularization

Runqian Wang; Kaiming He

arXiv:2506.09027·cs.CV·July 25, 2025

Diffuse and Disperse: Image Generation with Representation Regularization

Runqian Wang, Kaiming He

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Dispersive Loss, a simple regularizer for diffusion models that disperses internal representations, improving generative quality without extra data or pre-training, bridging generative modeling and representation learning.

Contribution

It proposes Dispersive Loss, a minimalist, plug-and-play regularizer that enhances diffusion models by promoting representation dispersion without requiring positive pairs or external data.

Findings

01

Consistent improvements on ImageNet across various models

02

No need for pre-training or external data

03

Enhances diffusion models by regularizing internal representations

Abstract

The development of diffusion-based generative models over the past decade has largely proceeded independently of progress in representation learning. These diffusion models typically rely on regression-based objectives and generally lack explicit regularization. In this work, we propose \textit{Dispersive Loss}, a simple plug-and-play regularizer that effectively improves diffusion-based generative models. Our loss function encourages internal representations to disperse in the hidden space, analogous to contrastive self-supervised learning, with the key distinction that it requires no positive sample pairs and therefore does not interfere with the sampling process used for regression. Compared to the recent method of representation alignment (REPA), our approach is self-contained and minimalist, requiring no pre-training, no additional parameters, and no external data. We evaluate…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The idea of removing positive pairs while retaining the repulsive regularization aspect is conceptually appealing and practically justified by diffusion models’ intrinsic alignment objective. - Comprehensive experiments across multiple architectures (DiT, SiT, MeanFlow) and scales (S/B/L/XL) show consistent improvements in FID and Inception Scores. - The improvement trend scales with model size, indicating the loss acts as an effective regularizer for large-capacity models prone to overfitting

Weaknesses

- The method is motivated intuitively but lacks a formal analysis of why dispersion improves generation quality. A deeper information-theoretic or geometric argument (e.g., on latent coverage or mutual information bounds) would strengthen the theoretical grounding. - While FID and Inception Scores are strong indicators, evaluation on semantic diversity, perceptual similarity, or representation quality (e.g., CLIP-based metrics) could better reveal what aspects of representation regularization im

Reviewer 02Rating 6Confidence 2

Strengths

1. The proposed regularizer can be directly integrated into diffusion models with intermediate representations and requires little additional computational effort. 2. It elegantly incorporates concepts from self-supervised learning into diffusion model training in a straightforward and theoretically sound manner. 3. Experiments on a real-world image dataset demonstrate that adding the regularizer significantly improves performance, and the experimental results are comprehensive.

Weaknesses

1. It would be helpful to clarify the scope of applicability. Can the proposed regularizer be applied to all diffusion models with intermediate representations? 2. Qualitative comparisons between images generated with and without the proposed regularizer would make the improvements more intuitive and visually convincing. 3. Although experiments explore different blocks, loss weights, and temperatures, it would be beneficial to provide systematic guidance or heuristics for selecting these hyper

Reviewer 03Rating 4Confidence 4

Strengths

1. The authors aim to address a fundamental problem, namely that the generation task should not be left to stand alone with representation learning, and offer insightful perspectives. 2. The paper features a clear structure and coherent logic. 3. The proposed method does not rely on a pretrained encoder.

Weaknesses

1. The authors used limited evaluation metrics. As they don't rely on a pretrained encoder and claim the importance of representation learning in the generative task, the authors should evaluate how the method performs with metrics like linear probing. 2. It's not clear why the major improvements were made in the case without CFG, while the performance with CFG only achieves very limited improvements. The authors should provide deeper analyses to explain this discrepancy and not rely only on FI

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsDiffusion