Elucidating the design space of language models for image generation
Xuantong Liu, Shaozhe Hao, Xianbiao Qi, Tianyang Hu, Jun Wang, Rong, Xiao, Yuan Yao

TL;DR
This paper explores the design space of language models for image generation, analyzing their behavior, challenges, and scalability, and introduces ELM, which achieves state-of-the-art results on ImageNet 256x256.
Contribution
It provides the first comprehensive analysis of language model optimization in vision tasks and offers insights into model design choices for image generation.
Findings
Larger models better capture global image context.
Image tokens exhibit more randomness than text tokens.
ELM achieves state-of-the-art performance on ImageNet 256x256.
Abstract
The success of autoregressive (AR) language models in text generation has inspired the computer vision community to adopt Large Language Models (LLMs) for image generation. However, considering the essential differences between text and image modalities, the design space of language models for image generation remains underexplored. We observe that image tokens exhibit greater randomness compared to text tokens, which presents challenges when training with token prediction. Nevertheless, AR models demonstrate their potential by effectively learning patterns even from a seemingly suboptimal optimization problem. Our analysis also reveals that while all models successfully grasp the importance of local information in image generation, smaller models struggle to capture the global context. In contrast, larger models showcase improved capabilities in this area, helping to explain the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques
