On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning

Magdalena Proszewska; Nikolay Malkin; N. Siddharth

arXiv:2506.00136·cs.LG·June 3, 2025

On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning

Magdalena Proszewska, Nikolay Malkin, N. Siddharth

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces DMZ, a diffusion autoencoder variant that combines effective representations for downstream tasks with efficient generation, by connecting diffusion autoencoders and diffusion models that learn their noising process.

Contribution

The paper proposes a new model, DMZ, that integrates design choices from diffusion autoencoders and diffusion models to improve representation quality and generation efficiency.

Findings

01

DMZ achieves better downstream task performance.

02

DMZ requires fewer denoising steps for generation.

03

The model effectively combines representation learning with efficient generation.

Abstract

Diffusion autoencoders (DAs) are variants of diffusion generative models that use an input-dependent latent variable to capture representations alongside the diffusion process. These representations, to varying extents, can be used for tasks such as downstream classification, controllable generation, and interpolation. However, the generative performance of DAs relies heavily on how well the latent variables can be modelled and subsequently sampled from. Better generative modelling is also the primary goal of another class of diffusion models -- those that learn their forward (noising) process. While effective at adjusting the noise process in an input-dependent manner, they must satisfy additional constraints derived from the terminal conditions of the diffusion process. Here, we draw a connection between these two classes of models and show that certain design decisions (latent…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

* The paper is clear and easy to follow. The comprehensive experiments convincingly isolate and evaluate the effects of each design choice. * The DMZ framework is fairly general: it performs well in unconditional generation and representation learning, and it can be extended to handle multimodal tasks such as image-to-image translation.

Weaknesses

* The cross-attention conditioning design is already widely used in modern diffusion transformers [1-2]. The current validation relies on an older U-Net architecture, so this component does not constitute a significant contribution by itself. * The choice of latent dimensionality $|z|$ appears ad hoc. For generation tasks it is guided by the label-space size (suggesting that relatively low dimensions yield better generation quality), whereas representation learning for downstream tasks benefits

Reviewer 02Rating 2Confidence 3

Strengths

- The focus on diffusion autoencoders is timely and relevant, addressing the need for efficient generation and representation learning. - The illustrations and explanations of DM/DA are clear. - The benchmarking tasks and datasets are appropriate for evaluating the proposed method.

Weaknesses

- The motivation and contribution of DMZ is unclear. - The algorithmic details of DMZ are insufficient. - The performance of DMZ is underwhelming compared to existing methods.

Reviewer 03Rating 2Confidence 4

Strengths

- Unlike previous diffusion autoencoders, DMZ does not rely on an auxiliary latent sampler. By directly sampling $z$ from a Bernoulli distribution, the method enables computationally efficient sampling. - The learned latent representation is shown to be effective even in a multi-modal framework, indicating its potential generality beyond standard generation tasks. - The proposed DDPM-based approach demonstrates clear improvements in generation quality, particularly when using a small number of

Weaknesses

- It is unclear how the latent variable can be sampled from a Bernoulli distribution without any prior regularization. In standard DA frameworks, auxiliary latent samplers (such as [1,2]) or additional regularization terms (such as [3]) are typically used to properly model the latent prior. Without such mechanisms, it is not evident how the encoder output would naturally follow a Bernoulli prior. This appears to be a critical limitation of the proposed method. - The effect of conditioning z onl

Code & Models

Repositories

exlab-research/dmz
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis

MethodsDiffusion