GlobalMamba: Global Image Serialization for Vision Mamba
Chengkun Wang, Wenzhao Zheng, Jie Zhou, and Jiwen Lu

TL;DR
GlobalMamba introduces a novel global image serialization method using DCT to transform images into sequences that preserve 2D structural information, enhancing vision mamba models' ability to capture global context.
Contribution
The paper proposes a new global image serialization technique using DCT, enabling vision mamba models to better exploit global information and causal relations in images.
Findings
Improved image classification accuracy on ImageNet-1K
Enhanced object detection performance on COCO
Better semantic segmentation results on ADE20K
Abstract
Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing, which ignore the intrinsic 2D structural correlations of images. It is also difficult to extract global information by sequential processing of local patches. In this paper, we propose a global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image. We first convert the image from the spatial domain to the frequency domain using Discrete Cosine Transform (DCT) and then arrange the pixels with corresponding frequency ranges. We further transform each set within the same frequency band back to the spatial domain…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. GlobalMamba orders image tokens by frequency, enabling the model to first capture basic structures like contours and then add finer details, mimicking the way "humans process visual information".
1. Lack of reporting inference time. Since the feature maps in stage 1 of GlobalMamba-T/S/B have high resolution, the inference time would be the concern. Can you report the inference time? 2. Limited performance gain. Compared with Vmamba, GlobalMamba brings limited performance gain in classification, segmentation, detection. I am happy to raise my score if the authors can address my concerns.
- Overall, the paper is well-written and easy to follow. - The motivation is clear. The proposed approach is reasonable.
- The method requires more FLOPs than V-Mamba. The performance improvement over V-Mamba is somewhat limited. On most benchmarks, the improvement over V-Mamba is only 0.2%-0.3%. - In Table 2, The result of GlobalMamba-M (Mini) is marked in bold. However, GlobalMamba-M (Mini) seems to have lower classification accuracy than PlainMamba-L1, EffVMamba-T and EffVMamba-S. Especially, EffVMamba-T is also more efficient in Params and FLOPs. Why EffVMamba is more efficient in the mini size?
The idea of frequency-based tokenization is interesting. It enables the model to retain more global context, particularly in the low-frequency bands, aligning with the frequency principle and mimicking human visual processing.
1. Marginal Performance Improvement. The observed accuracy gains are relatively minor, often below 1%, despite the increase in model complexity and especially the increase of FLOPs. 2. Higher Computational Cost. While downsampling can mitigate some of the overhead, the sequence length and computational demand increase, raising questions about efficiency versus benefit for limited gains. 3. Complexity is also Higher. The frequency-based segmentation and DCT/IDCT transformations add structural c
+ The manuscript is clear and well-structured, making it easy for readers to understand its content. + The proposed frequency-based image serialization method is straightforward and can be readily reimplemented.
- The exploration of extracting information from the frequency domain in relation to mamba schemes is one topic of recent research efforts [A, B]. This paper introduces an alternative method for utilizing frequency information. However, the experimental findings suggest that the performance improvements achieved through the proposed GlobalMamba are relatively minimal. Notably, the increase in percentage in FLOPS surpasses the enhancement in accuracy, raising questions about the actual benefits o
S1: The method provides quite significantly stronger results compared to other Mamba-based SSNs, with results for both image level and dense prediction tasks and models for various model capacities. S2: Tokenization is an important and often overlooked component in both transformers and SSMs for vision tasks, and the approach has merit as a pyramidal tokenizer for both modelling paradigms (SSNs and ViTs). This reviewer appreciates the author's contribution to this field. S3: The approach utili
W1: In contrast to how the method is motivated, the method does not subdivide the image into frequency bands, but instead employs a low-pass image pyramid. This has the effect of increasing the overall number of tokens in the sequence compared to other baselines, which is not mentioned in the paper. In the end, the tokenization method mimics classic Gaussian pyramids (which are not cited or referenced in the work). W2: While the authors argue that the approach provides “causal tokens”, the gro
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsDiscrete Cosine Transform · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Sparse Evolutionary Training
