Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Shilong Zhang; He Zhang; Zhifei Zhang; Chongjian Ge; Shuchen Xue; Shaoteng Liu; Mengwei Ren; Soo Ye Kim; Yuqian Zhou; Qing Liu; Daniil Pakhomov; Kai Zhang; Zhe Lin; Ping Luo

arXiv:2512.17909·cs.CV·December 22, 2025

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo

PDF

Open Access 2 Models

TL;DR

This paper presents a framework that adapts understanding-oriented encoder features for text-to-image generation and editing, achieving state-of-the-art reconstruction and improved performance by regularizing the latent space with a semantic-pixel reconstruction objective.

Contribution

It introduces a semantic-pixel reconstruction objective to regularize encoder features, enabling compact, semantically rich latent representations for improved generative tasks.

Findings

01

State-of-the-art image reconstruction accuracy

02

Faster convergence in generative tasks

03

Enhanced performance in text-to-image and editing applications

Abstract

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · 3D Shape Modeling and Analysis