CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
Yitong Chen, Zuxuan Wu, Xipeng Qiu, Yu-Gang Jiang

TL;DR
CaTok introduces a causal 1D image tokenizer with a MeanFlow decoder that improves image generation and reconstruction by capturing diverse visual concepts and supporting fast, high-fidelity sampling.
Contribution
It proposes a novel 1D causal image tokenizer using MeanFlow, addressing limitations of existing methods and enabling efficient, high-quality image generation and reconstruction.
Findings
Achieves state-of-the-art ImageNet reconstruction with 0.75 FID
Supports fast one-step generation and multi-step sampling
Attains competitive performance with fewer training epochs
Abstract
Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- This paper identifies that previous approaches either lack causality or introduce imbalance between tokens, which is interesting and inspiring. The proposed method CaTok allows interval-based token conditioning and enables one-step generation, which are meaningful properties for research in this line. - The paper clearly articulates the problem, solution, and experimental validation. The comparison framework (Figure 2) effectively illustrates the key differences between approaches.
- Some technical contributions, for instance employing mean flow for diffusion autoencoders and the use of REPA-A, might be incremental. It is plausible to use them but hardly claimed as a major contribution. - The performance on generation benchmarks is modest compared to the closest counterpart Semanticist. Was that because the tokenizer was not fully trained, and how would it perform if trained for 400 epochs? Also, there is a large gap between the reconstruction performance of CaTok-L-32 and
1. Using the MeanFlow objective to model average velocity over intervals [r,t] is a clever way to enforce causal consistency while allowing one-step sampling and balanced token usage. 2. The addition of REPA-A provides both empirical and conceptual contributions, leading to faster convergence and better feature quality. 3. CaTok achieves competitive or better performance with significantly fewer training epochs than comparable models.
1. The motivation is not well-described. The paper does not clearly justify why enforcing stronger causality in visual tokenization is inherently beneficial. While the motivation draws inspiration from autoregressive language models, the authors do not provide theoretical or empirical evidence showing that causal dependencies are necessary—or even advantageous—for visual representations, which may naturally rely more on spatial coherence than temporal order. 2. The model is only evaluated on Im
The step-by-step roadmap (MeanFlow → REPA → REPA-A → interval token selection) shows measurable gains and isolates each component’s effect on rFID.
1. All experiments are on ImageNet-1K at 256×256; no evidence for higher resolutions, other datasets, or broader generalization. 2. Support rests mainly on ablations and qualitative trends; even the authors note interval conditioning brings a performance drop, which undercuts the claim that causality is unequivocally beneficial. 3. The method combines MeanFlow, Rectified Flow (with adaptive L2 and r=t mixing), plus two representation-alignment losses—nontrivial to reproduce and tune. 4. The A
1. CaTok introduces a principled way to obtain causal 1D tokens from images by coupling a causal Vision Transformer encoder with a MeanFlow diffusion-based decoder, offering a practical approach to image tokenization that aligns well with next-token prediction patterns of language models. 2. The introduction of the REPA-A regularization, which aligns encoder representations with Vision Foundation Models, demonstrably accelerates and stabilizes training dynamics (Fig. 5b). Empirically, this lead
1. As shown in Table 1, the rFID of CaTok with one-step sampling is worse than the VQ baselines (e.g., One-D-Piece-B-256) with larger parameter size, which questions the effectiveness of this approach. 2. The experiment of image generation in Table 2 is insufficient. It would be better also to provide the performance of CaTok-B-256 and CaTok-L-256, as well as other baselines, to demonstrate the scaling potential of CaTok.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
