ImageFolder: Autoregressive Image Generation with Folded Tokens
Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, Zhe, Lin

TL;DR
This paper introduces ImageFolder, a semantic tokenizer that uses folded tokens and dual-branch product quantization to improve autoregressive image generation quality and efficiency without increasing token length.
Contribution
The paper presents a novel semantic tokenizer with folded tokens and dual-branch quantization, balancing reconstruction and generation quality in autoregressive models.
Findings
ImageFolder achieves superior image generation quality.
It enables shorter token lengths without sacrificing performance.
The method improves efficiency in autoregressive image modeling.
Abstract
Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality. To enhance the representative capability without increasing token length, we leverage…
Peer Reviews
Decision·ICLR 2025 Poster
- This paper targets learning more semantically compact and disentangled image token representations, which leads to better autoregressive modeling with improved performance and shorter token length. The issue of information density(token length) and the explanation of previous image tokenizers have been hindering the performance of visual autoregressive modeling. - Compared to previous work on image tokenizers, this work investigates product quantization to separate different information abo
- Several important components in this work have been investigated in previous works, including adopting semantic regularization, multi-scale RQ, and parallel decoding has been explored in earlier works. - This work lacks either enough theoretical or empirical discussions about product quantization. Additional details, such as more visualization and discussion as in Fig. 8, should be provided to validate the "disentangled" nature of the obtained image tokens. (see Questions) - This work m
1. The idea of this paper is novel. Through Product Quantization, token processing is carried out in smaller spaces, and good performance is achieved under fewer tokens. 2. The logic of this paper is rigorous, the argumentation is clear, the method is fully explained, and the experimental setting is rigorous.
Although this paper has reached an acceptable level, there are still several small problems that need to be worked on. 1. The description of performance indicators in the article is not in place. Tables 1, 2 and some later tables do not give a simple indication of superior performance represented by high or low indicators. 2. In Formula 4, the description of several hyperparameters and their Settings is missing.
1.The ImageFolder tokenizer achieves better generation performance by effectively folding tokens, which enhances the efficiency of the autoregressive modeling process. 2.Despite reducing token length, the proposed method does not sacrifice the quality of image reconstruction, making it a balanced solution for generative tasks. 3.By leveraging product quantization and introducing mechanisms like semantic regularization and quantizer dropout, the tokenizer captures richer semantic and pixel-leve
1.The training is based on the tokenizer from LlamaGen. Is it possible to train one from scratch to better demonstrate its effectiveness? 2.How is the token length of 265 derived from the residual scales [1, 1, 2, 3, 3, 4, 5, 6, 8, 11]? 3.Are there other length design comparisons for token length, such as 265, 430, etc., to better validate the results?
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsDiffusion
