A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive   Transformer for Efficient Finegrained Image Generation

Liang Chen; Sinan Tan; Zefan Cai; Weichu Xie; Haozhe Zhao; Yichi; Zhang; Junyang Lin; Jinze Bai; Tianyu Liu; Baobao Chang

arXiv:2410.01912·cs.CV·October 4, 2024

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Liang Chen, Sinan Tan, Zefan Cai, Weichu Xie, Haozhe Zhao, Yichi, Zhang, Junyang Lin, Jinze Bai, Tianyu Liu, Baobao Chang

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces the DnD-Transformer, a 2D autoregressive model that enhances image generation quality and demonstrates emergent vision-language understanding by generating images with text and graphics.

Contribution

The novel 2D autoregression architecture improves image quality and enables self-supervised vision-language capabilities in autoregressive models.

Findings

01

Higher quality images with same model size and sequence length

02

Ability to generate images with rich text and graphical elements

03

Demonstrates vision-language understanding without multimodal training

Abstract

This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, \textit{model depth}, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer's potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenllliang/dnd-transformer
pytorchOfficial

Models

🤗
leonardPKU/DnD-Transformer
model· 2 dl· ♡ 3
2 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings