Elucidating the design space of language models for image generation

Xuantong Liu; Shaozhe Hao; Xianbiao Qi; Tianyang Hu; Jun Wang; Rong; Xiao; Yuan Yao

arXiv:2410.16257·cs.CV·October 22, 2024

Elucidating the design space of language models for image generation

Xuantong Liu, Shaozhe Hao, Xianbiao Qi, Tianyang Hu, Jun Wang, Rong, Xiao, Yuan Yao

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper explores the design space of language models for image generation, analyzing their behavior, challenges, and scalability, and introduces ELM, which achieves state-of-the-art results on ImageNet 256x256.

Contribution

It provides the first comprehensive analysis of language model optimization in vision tasks and offers insights into model design choices for image generation.

Findings

01

Larger models better capture global image context.

02

Image tokens exhibit more randomness than text tokens.

03

ELM achieves state-of-the-art performance on ImageNet 256x256.

Abstract

The success of autoregressive (AR) language models in text generation has inspired the computer vision community to adopt Large Language Models (LLMs) for image generation. However, considering the essential differences between text and image modalities, the design space of language models for image generation remains underexplored. We observe that image tokens exhibit greater randomness compared to text tokens, which presents challenges when training with token prediction. Nevertheless, AR models demonstrate their potential by effectively learning patterns even from a seemingly suboptimal optimization problem. Our analysis also reveals that while all models successfully grasp the importance of local information in image generation, smaller models struggle to capture the global context. In contrast, larger models showcase improved capabilities in this area, helping to explain the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Pepper-lll/LMforImageGeneration
pytorchOfficial

Models

🤗
xuantonglll/ELM
model· ♡ 1
♡ 1

Videos

Elucidating the design space of language models for image generation· slideslive

Taxonomy

TopicsImage Retrieval and Classification Techniques