ImageFolder: Autoregressive Image Generation with Folded Tokens

Xiang Li; Kai Qiu; Hao Chen; Jason Kuen; Jiuxiang Gu; Bhiksha Raj; Zhe; Lin

arXiv:2410.01756·cs.CV·December 5, 2024

ImageFolder: Autoregressive Image Generation with Folded Tokens

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, Zhe, Lin

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces ImageFolder, a semantic tokenizer that uses folded tokens and dual-branch product quantization to improve autoregressive image generation quality and efficiency without increasing token length.

Contribution

The paper presents a novel semantic tokenizer with folded tokens and dual-branch quantization, balancing reconstruction and generation quality in autoregressive models.

Findings

01

ImageFolder achieves superior image generation quality.

02

It enables shorter token lengths without sacrificing performance.

03

The method improves efficiency in autoregressive image modeling.

Abstract

Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality. To enhance the representative capability without increasing token length, we leverage…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

- This paper targets learning more semantically compact and disentangled image token representations, which leads to better autoregressive modeling with improved performance and shorter token length. The issue of information density(token length) and the explanation of previous image tokenizers have been hindering the performance of visual autoregressive modeling. - Compared to previous work on image tokenizers, this work investigates product quantization to separate different information abo

Weaknesses

- Several important components in this work have been investigated in previous works, including adopting semantic regularization, multi-scale RQ, and parallel decoding has been explored in earlier works. - This work lacks either enough theoretical or empirical discussions about product quantization. Additional details, such as more visualization and discussion as in Fig. 8, should be provided to validate the "disentangled" nature of the obtained image tokens. (see Questions) - This work m

Reviewer 02Rating 6Confidence 4

Strengths

1. The idea of this paper is novel. Through Product Quantization, token processing is carried out in smaller spaces, and good performance is achieved under fewer tokens. 2. The logic of this paper is rigorous, the argumentation is clear, the method is fully explained, and the experimental setting is rigorous.

Weaknesses

Although this paper has reached an acceptable level, there are still several small problems that need to be worked on. 1. The description of performance indicators in the article is not in place. Tables 1, 2 and some later tables do not give a simple indication of superior performance represented by high or low indicators. 2. In Formula 4, the description of several hyperparameters and their Settings is missing.

Reviewer 03Rating 6Confidence 4

Strengths

1.The ImageFolder tokenizer achieves better generation performance by effectively folding tokens, which enhances the efficiency of the autoregressive modeling process. 2.Despite reducing token length, the proposed method does not sacrifice the quality of image reconstruction, making it a balanced solution for generative tasks. 3.By leveraging product quantization and introducing mechanisms like semantic regularization and quantizer dropout, the tokenizer captures richer semantic and pixel-leve

Weaknesses

1.The training is based on the tokenizer from LlamaGen. Is it possible to train one from scratch to better demonstrate its effectiveness? 2.How is the token length of 265 derived from the residual scales [1, 1, 2, 3, 3, 4, 5, 6, 8, 11]? 3.Are there other length design comparisons for token length, such as 265, 430, etc., to better validate the results?

Code & Models

Repositories

lxa9867/imagefolder
pytorchOfficial

Videos

ImageFolder: Autoregressive Image Generation with Folded Tokens· slideslive

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques

MethodsDiffusion