VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling

Qian Zhang; Xiangzi Dai; Ninghua Yang; Xiang An; Ziyong Feng; Xingyu; Ren

arXiv:2408.01181·cs.CV·August 5, 2024

VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling

Qian Zhang, Xiangzi Dai, Ninghua Yang, Xiang An, Ziyong Feng, Xingyu, Ren

PDF

Open Access 1 Repo

TL;DR

VAR-CLIP introduces a novel text-to-image generation approach that combines visual auto-regressive modeling with CLIP, enabling high-quality, faithful, and aesthetic image synthesis guided by textual captions.

Contribution

It integrates visual auto-regressive modeling with CLIP for flexible text-guided image generation and constructs a large dataset for training on extensive datasets like ImageNet.

Findings

01

High fidelity and aesthetic quality in generated images

02

Effective caption guidance through word positioning analysis

03

Successful training on large-scale datasets like ImageNet

Abstract

VAR is a new generation paradigm that employs 'next-scale prediction' as opposed to 'next-token prediction'. This innovative transformation enables auto-regressive (AR) transformers to rapidly learn visual distributions and achieve robust generalization. However, the original VAR model is constrained to class-conditioned synthesis, relying solely on textual captions for guidance. In this paper, we introduce VAR-CLIP, a novel text-to-image model that integrates Visual Auto-Regressive techniques with the capabilities of CLIP. The VAR-CLIP framework encodes captions into text embeddings, which are then utilized as textual conditions for image generation. To facilitate training on extensive datasets, such as ImageNet, we have constructed a substantial image-text dataset leveraging BLIP2. Furthermore, we delve into the significance of word positioning within CLIP for the purpose of caption…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

daixiangzi/var-clip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training