Autoregressive Pretraining with Mamba in Vision
Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei, Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie

TL;DR
This paper demonstrates that autoregressive pretraining significantly enhances the visual capabilities of the Mamba state space model, leading to higher accuracy and better scalability for large models in vision tasks.
Contribution
It introduces autoregressive pretraining for Mamba, improving training efficiency and model performance, and enabling effective scaling to larger model sizes.
Findings
Base-size Mamba achieves 83.2% ImageNet accuracy, surpassing supervised training.
Huge-size Mamba attains 85.0% ImageNet accuracy, outperforming other variants.
Autoregressive pretraining enables faster training and better scalability.
Abstract
The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2\% ImageNet accuracy, outperforming its supervised counterpart by 2.0\%;…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper is well written, including all technical details. It should be straightforward to implement. - It is good to see Mamba follow similar pre-training conclusion with other vision architecture (CNN, Transformer): masked prediction in pre-training followed by fine-tuning boosts performance. - It is good to know that cluster size plays an important role in pre-training.
- Figure 3 is a bit confusing. It seems to illustrate a case of 4 clusters, each has 4 patches. I assume that the arrows within a cluster show the order to flat patches, while the arrows across clusters show the order of prediction. If so, please use different colors and add explanation in the caption. - In Table 2, please distinguish methods between supervised and pre-training+supervised fine-tuning. In addition, what is the performance for MambaMLP in huge size? - Section 4.4, is it possible t
This paper is overall writing well. This work to investigates the mamba architecture using autoregressive pre-training in visual data is important for the community. The results are inspiring for the community, e.g., ARM can scale somewhat well, compared to the Vim counterparts; and the ablation on the prediction unit shows the effectiveness of the cluster-based prediction.
1.My main concern is the experimental setting: (1) When the results of the proposed method in Table 2 is good, it is not clear for the setting up of other supervised baselines (RegNetY-16G, DeiT-B, Vim-B), e.g, how many epochs these model training? Is it fair to compare these baselines without the same training epochs? (2) Besides, the protocol for evaluating self-supervised method is usually pre-trained on large-scale dataset, then evaluated on the downstream tasks (e.g., linear probe or f
The experiments clearly show the benefit of autoregressive pretraining for this task, on both MambaMLP (the proposed architecture here) and Vim. Ablations on token cluster size (a nice idea), decoder and patch representations, and sequence order describe the system behavior well. Overall, the presentation is quite clear and the system well described, though there are key points that I think could be explained better (see weaknesses below).
There are several important pieces of the method that lacked detailed explanation: * The MambaMLP block is described only in a sketch diagram (Fig 4) and a sentence in sec 3.3 ("uses Mamba as the token mixer and MLP as the channel mixer"). This is a good summary, but didn't explain the block architecture well enough for me to understand the details: Does this mean Mamba is applied independently to each channel and alternates with MLP? And how are the Delta, B and C set (which inputs and whic
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpatial Cognition and Navigation
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
