Unified Latents (UL): How to train your latents

Jonathan Heek; Emiel Hoogeboom; Thomas Mensink; Tim Salimans

arXiv:2602.17270·cs.LG·February 20, 2026

Unified Latents (UL): How to train your latents

Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, Tim Salimans

PDF

Open Access

TL;DR

Unified Latents (UL) introduces a joint regularization framework for learning latent representations using diffusion models, achieving high-quality image and video generation with efficient training and state-of-the-art metrics.

Contribution

UL presents a novel training objective linking encoder noise to diffusion prior noise, enabling efficient learning of high-quality, regularized latent representations for images and videos.

Findings

01

Achieves FID of 1.4 on ImageNet-512.

02

Sets a new FVD of 1.3 on Kinetics-600.

03

Requires fewer training FLOPs than Stable Diffusion models.

Abstract

We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis