MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis
Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun, Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, Ziwei Huang, LeiLei Gan, Hao, Jiang

TL;DR
MARS is a novel auto-regressive framework for text-to-image synthesis that integrates pre-trained language models with visual understanding, achieving high-quality, bilingual, and efficient image generation.
Contribution
Introduces SemVIE, a new component that combines language and visual processing in auto-regressive models, enhancing T2I synthesis with multi-stage training and bilingual capabilities.
Findings
Achieves high-quality T2I generation with only 9% of the GPU days of SD1.5.
Supports bilingual (English and Chinese) image and text generation.
Significantly improves text-image alignment and image detail granularity.
Abstract
Auto-regressive models have made significant progress in the realm of language generation, yet they do not perform on par with diffusion models in the domain of image synthesis. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information, freezing the textual component while fine-tuning the visual component. This methodology preserves the NLP capabilities of LLMs while imbuing them with exceptional visual understanding. Building upon the powerful base of the pre-trained Qwen-7B, MARS stands out with its bilingual generative capabilities corresponding to both English and Chinese language prompts and the capacity for joint image and text generation. The flexibility of this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Processing and 3D Reconstruction · Advanced Neural Network Applications
MethodsDiffusion · Balanced Selection
