Diffusion Bridge AutoEncoders for Unsupervised Representation Learning
Yeongmin Kim, Kwanghyeon Lee, Minsang Park, Byeonghu Na, Il-Chul Moon

TL;DR
This paper introduces Diffusion Bridge AutoEncoders (DBAE), a novel diffusion-based model that improves unsupervised representation learning by enabling sample-dependent inference, full information retention in latent space, and high-quality sample generation.
Contribution
The paper proposes DBAE, a new diffusion autoencoder architecture that addresses information splitting issues and allows z-dependent inference, enhancing representation learning and sample quality.
Findings
Enhances downstream inference, reconstruction, and disentanglement.
Generates high-fidelity unconditional samples.
Enables z-dependent endpoint inference for better representations.
Abstract
Diffusion-based representation learning has achieved substantial attention due to its promising capabilities in latent representation and sample generation. Recent studies have employed an auxiliary encoder to identify a corresponding representation from a sample and to adjust the dimensionality of a latent variable z. Meanwhile, this auxiliary structure invokes information split problem because the diffusion and the auxiliary encoder would divide the information from the sample into two representations for each model. Particularly, the information modeled by the diffusion becomes over-regularized because of the static prior distribution on xT. To address this problem, we introduce Diffusion Bridge AuteEncoders (DBAE), which enable z-dependent endpoint xT inference through a feed-forward architecture. This structure creates an information bottleneck at z, so xT becomes dependent on z in…
Peer Reviews
Decision·ICLR 2025 Spotlight
The paper addresses an important issue in representation learning with diffusion models. The proposed solution appears theoretically sound and empirically convincing to me, but I have to admit that I am not an expert on the relevant literature regarding other solutions to the discussed issue (if any), nor on state-of-the-art empirical performance, so I will yield to the judgment of the other reviewers in this regard. The paper is remarkably well organized. Despite addressing a rather complicate
I am not an expert on diffusion models with auxiliary encoders, so I cannot judge the novelty of the proposed method or the choice of baselines in the empirical evaluation. I only identified minor weaknesses (although I strongly recommend addressing at least the last point below as it should be relatively easy to fix and would make the paper much more accessible). I did not understand the significance of the information split problem in unconditional generation (Section 4.4.2). I appreciate th
Quality and clarity: the paper is well-written and the idea is motivated and easy to follow. Originality: the paper aims to solve the so-called information-split problem in the field. The effectiveness of the proposed method is well-supported by theorems and comprehensive experiments. The theorems indicate the proposed loss can indeed increase the mutual information between the input images and the latent variable. Moreover, the generated data distribution is guaranteed to be close to the in
1. It is unclear how the loss makes the endpoint dependent on the latent variable. The theorems only show the relationship between the input data and the latent variable instead, which is not aligned with the claim. 2. The intuition is missing in Section 4.1. For example, why the new forward SDE is defined as in Eq. 10? Providing more intuition would help readers to appreciate the forward process.
1. Combining VAE with diffusion is of practical interest. The motivation and solution seem reasonable to me. The authors also provide some theoretical justifications of the model and objective functions. 2. There are thorough experiments to demonstrate the usefulness and performance of the proposed model.
1. In the diffusion bridge Eq. (5), the input is transformed into a fixed target. However, in the proposed model, the target $x_T$ is random due to the randomness of the VAE. Does this affect the theory? 2. In Section 4.4, the authors present objective functions for reconstruction and generation tasks. In experiments, did the authors optimize different objectives and use different models for different tasks? Can we obtain both reconstruction and generation abilities using the same model? 3. I
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsDiffusion
