MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning
Jinhua Zhang, Wei Long, Minghao Han, Weiyi You, Shuhang Gu

TL;DR
MVAR introduces a novel autoregressive framework with scale and spatial Markov assumptions, significantly reducing computational complexity and memory usage in visual data modeling while maintaining or improving performance.
Contribution
The paper proposes a new Markovian autoregressive model for visual data that reduces redundancy and computational complexity through scale and spatial Markov assumptions.
Findings
Reduces GPU memory footprint by 3.0x.
Achieves comparable or superior performance on ImageNet.
Enables training with fewer GPUs and no KV cache during inference.
Abstract
Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper is overall well written and rather easy to follow. The figures look nice and captures the core idea in a glimpse. 2. The scale and spatial redundancy in original VAR work makes intuitive sense to me, and the proposed solution is both straightforward and effective as shown in experiments. 3. MVAR shows both performance advantages over vanilla VAR, as well as reduced memory footprint and accelerated inference speed.
1. Only model size of 300M is studied in this paper. It is unclear if MVAR shows good scalability or not. The authors are encouraged to show the scaling trend following VAR (up to 2B model). 2. Beyonds ImageNet unconditional generation, The authors are encouraged to try the image in-painting and out-painting task and class-conditional image editing task as in the zero-shot setup in VAR paper.
- The motivation is clear and well supported by empirical observations of redundancy in next-scale prediction. - The Markovian formulation is conceptually simple yet brings strong computational benefits. - The method achieves 3× memory reduction without degrading generation quality. - Results on ImageNet demonstrate good efficiency–performance trade-offs, making the approach practical for large-scale settings. - The paper is overall well written and easy to follow.
- The experimental comparison is limited. The paper does not include results against related methods such as Randomized Autoregressive Visual Generation (RAVG, 2024), which also targets efficiency improvement in visual AR models. - Table 1 reports results compared with VAR-d16, but it is unclear whether the MVAR model also uses 16 decoder layers. The discrepancy between Table 1 and Table 2 results (different FID/IS scores) suggests inconsistent model settings. - Several experiments in the append
1. The paper accurately identifies and solves a critical bottleneck in existing VAR models related to memory and computation (especially the KV cache). This is a practical problem that has hindered scaling these models to larger sizes and higher resolutions. 2. MVAR's two core components (scale-Markov trajectory and spatial-Markov attention) are conceptually simple but highly effective. The "KV-cache-free" and "parallel training" properties drastically reduce the training cost.
1. While the scale-Markov assumption proves effective in the current setup, it might be an approximation. In more complex scenarios requiring strong long-range dependencies (e.g., images with complex global structures or multi-object interactions), discarding all information from $r_1$ to $r_{l-2}$ could become a performance bottleneck. 2. The paper mentions that training for $r_1$ to $r_8$ is parallel, but $r_9$ and $r_{10}$ (which account for 60% of the tokens) seem to be handled separately us
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need
