SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu

TL;DR
SVG-T2I introduces a novel approach for high-quality text-to-image synthesis directly within the Visual Foundation Model (VFM) feature space, demonstrating competitive results without using Variational Autoencoders.
Contribution
The paper extends the SVG framework to support direct text-to-image generation in the VFM domain, enabling scalable and effective diffusion-based synthesis without VAEs.
Findings
Achieves 0.75 on GenEval and 85.78 on DPG-Bench benchmarks.
Validates the representational power of VFMs for generative tasks.
Provides open-source tools and models for further research.
Abstract
Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship
