Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha

TL;DR
This paper introduces AliTok, an aligned tokenizer that bridges the gap between tokenization and autoregressive modeling in image generation, leading to improved fidelity and faster sampling compared to diffusion methods.
Contribution
AliTok is a novel tokenizer design that aligns token dependencies with autoregressive models, enabling high-quality image generation with fewer parameters and faster sampling speeds.
Findings
Achieved state-of-the-art gFID of 1.28 on ImageNet-256 with 662M parameters.
Surpassed diffusion models in sampling speed by 10x.
Demonstrated high fidelity and predictability in image generation.
Abstract
Autoregressive image generation aims to predict the next token based on previous ones. However, this process is challenged by the bidirectional dependencies inherent in conventional image tokenizations, which creates a fundamental misalignment with the unidirectional nature of autoregressive models. To resolve this, we introduce AliTok, a novel Aligned Tokenizer that alters the dependency structure of the token sequence. AliTok employs a bidirectional encoder constrained by a causal decoder, a design that compels the encoder to produce a token sequence with both semantic richness and forward-dependency. Furthermore, by incorporating prefix tokens and employing a two-stage tokenizer training process to enhance reconstruction performance, AliTok achieves high fidelity and predictability simultaneously. Building upon AliTok, a standard decoder-only autoregressive model with just 177M…
Peer Reviews
Decision·ICLR 2026 Poster
1. The intuition of tailoring the tokenizer to suit the AR generation is well-established. Image as 2-d grid does not really have a naturally direction. Several existing method uses random order AR. AliTok solves this with 1d tokenizer and a causal decoder. 2. The fix for "first row" using prefix tokens is smart. As stated, image does not really have direction, this makes the first row hard to predict. Using prefix row to encode global semantic information provides guidance for this with minimal
1. No system-level comparison of reconstruction performance. Despite the final goal for tokenizer is to enable better generation, a detailed comparison and analysis of reconstruction performance cannot be neglected. This is a huge missing in a paper focusing on tokenizer. 2. Several experimental details are missing, especially for reconstruction. For example, what is the batch size / epochs used in two-stage training? What are the size of the tokenizer compared to other model? Without those deta
- Clear motivation and writing. The authors communicate the misalignment problem and causal-decoder idea clearly. - Comprehensive ablations. The study isolates each design component’s effect on AR accuracy and gFID. - Interpretability. Attention-map visualization nikens learn a causal bias. - Empirical competence. Experiments are reproducible, metrics are standard, and odel sizes.
- **Limited conceptual novelty:** The core idea—adding a causal constraint to the tokenizer—is elegant but incremental, extending existing causal-structure ideas (e.g., masked or random-order AR) rather than offering a fundamentally new paradigm. - **Order-specific and brittle design:** The method hard-codes raster order and needs ad-hoc fixes (prefix tokens, auxiliary loss) for top-row reconstruction. It is unclear if the alignment holds under different scan orders, resolutions, or tas
1. The paper is clearly written and well-structured, making it easy to read and understand. 2. The work provides a compelling analysis of the fundamental conflict between conventional bidirectional tokenizers and unidirectional autoregressive (AR) models, and addresses it through a novel design that combines a bidirectional encoder with a causally constrained decoder to facilitate AR image generation. 3. The proposed framework is thoroughly validated through extensive experiments, ablation studi
1. The experimental evaluation is limited to ImageNet-256; results on higher-resolution datasets such as ImageNet-512 are missing, which leaves uncertainty about the scalability and robustness of the proposed method to larger image resolutions. 2. The paper does not provide a detailed analysis or explanation for the reported 10× sampling speedup. My understanding is that the method introduces additional buffer tokens and prefix tokens, which could potentially incur higher computational cost comp
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques
MethodsDiffusion
