Visual Generation Tuning
Jiahao Guo, Sinan Du, Jingfeng Yao, Wenyu Liu, Bo Li, Haoxiang Cao, Kun Gai, Chun Yuan, Kai Wu, Xinggang Wang

TL;DR
This paper introduces VGT, a new method to enable vision language models to perform visual generation efficiently, achieving state-of-the-art results and significant speedups in training.
Contribution
VGT is a novel paradigm that aligns pretrained VLMs with pixel decoders, significantly reducing alignment costs and accelerating autoregressive visual generation.
Findings
Achieves 26.67 PSNR and 0.50 rFID in image reconstruction.
Attains state-of-the-art results on GenEval and DPG-Bench.
Provides a versatile approach for endowing VLMs with visual generation capabilities.
Abstract
Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Language and cultural evolution
