SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Minglei Shi; Haolin Wang; Borui Zhang; Wenzhao Zheng; Bohan Zeng; Ziyang Yuan; Xiaoshi Wu; Yuanxing Zhang; Huan Yang; Xintao Wang; Pengfei Wan; Kun Gai; Jie Zhou; Jiwen Lu

arXiv:2512.11749·cs.CV·December 15, 2025

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu

PDF

Open Access 1 Models

TL;DR

SVG-T2I introduces a novel approach for high-quality text-to-image synthesis directly within the Visual Foundation Model (VFM) feature space, demonstrating competitive results without using Variational Autoencoders.

Contribution

The paper extends the SVG framework to support direct text-to-image generation in the VFM domain, enabling scalable and effective diffusion-based synthesis without VAEs.

Findings

01

Achieves 0.75 on GenEval and 85.78 on DPG-Bench benchmarks.

02

Validates the representational power of VFMs for generative tasks.

03

Provides open-source tools and models for further research.

Abstract

Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
KlingTeam/SVG-T2I
model· 14 dl· ♡ 32
14 dl♡ 32

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship