MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image   Synthesis

Wanggui He; Siming Fu; Mushui Liu; Xierui Wang; Wenyi Xiao; Fangxun; Shu; Yi Wang; Lei Zhang; Zhelun Yu; Haoyuan Li; Ziwei Huang; LeiLei Gan; Hao; Jiang

arXiv:2407.07614·cs.CV·July 12, 2024

MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun, Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, Ziwei Huang, LeiLei Gan, Hao, Jiang

PDF

Open Access 1 Repo

TL;DR

MARS is a novel auto-regressive framework for text-to-image synthesis that integrates pre-trained language models with visual understanding, achieving high-quality, bilingual, and efficient image generation.

Contribution

Introduces SemVIE, a new component that combines language and visual processing in auto-regressive models, enhancing T2I synthesis with multi-stage training and bilingual capabilities.

Findings

01

Achieves high-quality T2I generation with only 9% of the GPU days of SD1.5.

02

Supports bilingual (English and Chinese) image and text generation.

03

Significantly improves text-image alignment and image detail granularity.

Abstract

Auto-regressive models have made significant progress in the realm of language generation, yet they do not perform on par with diffusion models in the domain of image synthesis. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information, freezing the textual component while fine-tuning the visual component. This methodology preserves the NLP capabilities of LLMs while imbuing them with exceptional visual understanding. Building upon the powerful base of the pre-trained Qwen-7B, MARS stands out with its bilingual generative capabilities corresponding to both English and Chinese language prompts and the capacity for joint image and text generation. The flexibility of this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fusiming3/mars
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Processing and 3D Reconstruction · Advanced Neural Network Applications

MethodsDiffusion · Balanced Selection