TL;DR
CoVAE introduces a single-stage training framework for VAEs using consistency models, enabling high-quality, fast sampling without a learned prior, and unifying autoencoding with diffusion-style generative modeling.
Contribution
It proposes CoVAE, a novel single-stage VAE training method that leverages consistency training and progressive latent representations, reducing complexity and sampling time.
Findings
CoVAE generates high-quality samples in one or few steps.
Outperforms traditional VAEs and single-stage methods.
Provides a unified autoencoding and diffusion modeling framework.
Abstract
Current state-of-the-art generative approaches frequently rely on a two-stage training procedure, where an autoencoder (often a VAE) first performs dimensionality reduction, followed by training a generative model on the learned latent space. While effective, this introduces computational overhead and increased sampling times. We challenge this paradigm by proposing Consistency Training of Variational AutoEncoders (CoVAE), a novel single-stage generative autoencoding framework that adopts techniques from consistency models to train a VAE architecture. The CoVAE encoder learns a progressive series of latent representations with increasing encoding noise levels, mirroring the forward processes of diffusion and flow matching models. This sequence of representations is regulated by a time dependent parameter that scales the KL loss. The decoder is trained using a consistency loss…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper formulates the VAE reparametrization as a time-indexed “forward process” in latent space and replaces standard reconstruction with a discrete consistency objective that bootstraps from early times. The method section and algorithmic details substantiate this bridge. 2. CoVAE generates in one step and can optionally do few-step refinement by re-encoding/re-denoising at intermediate t. This is a practical departure from the common “VAE + latent diffusion/flow” recipe. 3. The paper s
1. A major concern is the applicability of the proposed approach, both to future research and real-world application. While CoVAE aims to unify VAE and the diffusion process for generation tasks in one single stage, it neglects text (or class) conditioning in modeling and implementation for image generation, which is crucial in current generative models. The paper compares CoVAE with standard VAE and demonstrates its advantages. However, standard VAE can be readily used to modeling visual signal
- The proposed idea in CoVAE to use consistency training in VAEs is novel and interesting. - The performance among VAEs is better, and CoVAE also offers the option to trade off efficiency and performance with multi-step generation.
- There is limited insight into the fundamental difference consistency training brings in VAEs that leads to performance improvements. While iterative denoising is intuitively justified in diffusion-based or consistency models—where the coupling between latent variables and data points is unknown, it is less clear in the case of VAEs, where the latent variable corresponding to a given data point can be obtained through the encoder. - In Section 2.1, the authors mention the prior hole problem, wh
Strengths of the paper include: - Concise, clear mathematical introduction of VAEs, Diffusion models, Consistency models, and the proposed CoVAE approach. - Detailed experiments including multiple datasets and multiple baseline models - Detailed and fair discussion of related work - Clear statements of limitations of the current work that identify important problems to address in future research
The main weaknesses of the paper are twofold; - First, the datasets are limited to MNIST, CIFAR10, and CelebA - Second, the models compared with are valuable but not state-of-the-art in terms of performance Minor issues: - Small typo in Figure 1 caption: Consistenct - Line 229 "Form small time steps" - Figure 2 is confusing. I suggest you explain what the objects in the future are one by one, starting from the left. E.g. is "In Diffusion and Consistency" about the first picture or the first
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
