DiffEnc: Variational Diffusion with a Learned Encoder
Beatrix M. G. Nielsen, Anders Christensen, Andrea Dittadi, Ole Winther

TL;DR
This paper introduces DiffEnc, a flexible diffusion model with a learned encoder that improves likelihood performance on CIFAR-10 by incorporating data-dependent means and adjustable noise ratios, offering new theoretical insights.
Contribution
The paper proposes a novel diffusion framework with a learned encoder, data-dependent means, and adjustable noise ratios, enhancing model flexibility and theoretical understanding.
Findings
Significant likelihood improvement on CIFAR-10
Theoretical insights into ELBO and noise scheduling
Flexible diffusion loss with learned encoder
Abstract
Diffusion models may be viewed as hierarchical variational autoencoders (VAEs) with two improvements: parameter sharing for the conditional distributions in the generative process and efficient computation of the loss as independent terms over the hierarchy. We consider two changes to the diffusion model that retain these advantages while adding flexibility to the model. Firstly, we introduce a data- and depth-dependent mean function in the diffusion process, which leads to a modified diffusion loss. Our proposed framework, DiffEnc, achieves a statistically significant improvement in likelihood on CIFAR-10. Secondly, we let the ratio of the noise variance of the reverse encoder process and the generative process be a free weight parameter rather than being fixed to 1. This leads to theoretical insights: For a finite depth hierarchy, the evidence lower bound (ELBO) can be used as an…
Peer Reviews
Decision·ICLR 2024 poster
The paper is well written and easy to follow. It adds a simple yet interesting addition to diffusion by introducing a mean shift to the forward diffusion while not being required in the sampling process and therefore ensuring its scalability. The theoretical analysis of the different noise variances adds an interesting flavour too. The results seem to indicate that the added encoder improves the performance in terms of bits per dimension.
The paper has very limited evaluation and doesn't compare to some of relevant baselines that are even mentioned in the paper, like latent diffusion and only compares on Cifar-10 and MNIST. Furthermore, it mentions that some methods only show improvement after longer training, hinting at potential inconsistencies in the results in case of slightly different training setups due to not training till convergence. It is hard to judge whether the proposed changes are a significant improvement due to t
I believe that implementing a trainable encoder within the context of the diffusion model represents a promising avenue for enhancing diffusion models, particularly in terms of ELBO optimization. This paper offers comprehensive insights into the derivation and analysis, rendering it accessible and straightforward to grasp. The experimental findings are not only persuasive but also harmonize effectively with the theoretical framework. For instance, in Figure 1, we observe a logical outcome indica
1. The organization of sections is perplexing. It's challenging for me to discern whether Section 2 serves as an introductory section or is meant to highlight one of your contributions. 2. The absence of a central theorem throughout the paper poses a difficulty for readers in anticipating the direction of the derivations and what to expect.
The high-level idea for the paper is quite natural and something that somebody was bound to try because of its potential impact. Overall, I found the writing fairly clear, with some exceptions that I will mention in the next section. The analysis of the method is fairly extensive and supported by lot of details in the appendix, though these are primarily mathematical proofs and not necessarily an exploration of design decisions that have a high level of practical significance.
The presentation of the method, namely section 3 and 6 could be improved significantly. There are a lot of variable names, and I had to read through the section many time in order to understand what was happening, even though the final procedure is not that complex. Moving the figure provided in Appendix A to the main text might be helpful in this regard. Or you could include an algorithm, or simply a link to your code, as these would all be easier to parse as someone familiar with common diffus
Code & Models
Videos
Taxonomy
TopicsModel Reduction and Neural Networks · Generative Adversarial Networks and Image Synthesis · Topic Modeling
MethodsDiffusion
