End-to-End Training for Unified Tokenization and Latent Denoising

Shivam Duggal; Xingjian Bai; Zongze Wu; Richard Zhang; Eli Shechtman; Antonio Torralba; Phillip Isola; William T. Freeman

arXiv:2603.22283·cs.CV·March 24, 2026

End-to-End Training for Unified Tokenization and Latent Denoising

Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman, Antonio Torralba, Phillip Isola, William T. Freeman

PDF

Open Access

TL;DR

This paper introduces UNITE, a unified autoencoder architecture that jointly trains tokenization and latent diffusion in a single stage, simplifying the process and achieving high-quality results across image and molecule domains.

Contribution

UNITE enables joint training of tokenization and diffusion in one stage, eliminating the need for complex multi-stage training and pretrained encoders, and achieves near state-of-the-art performance.

Findings

01

Achieves FID 2.12 and 1.73 on ImageNet 256x256 for Base and Large models.

02

Demonstrates the feasibility of single-stage joint training for tokenization and generation.

03

Shows that shared parameters foster a common latent language across modalities.

Abstract

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Machine Learning in Materials Science