GlobalMamba: Global Image Serialization for Vision Mamba

Chengkun Wang; Wenzhao Zheng; Jie Zhou; and Jiwen Lu

arXiv:2410.10316·cs.CV·October 15, 2024·2 cites

GlobalMamba: Global Image Serialization for Vision Mamba

Chengkun Wang, Wenzhao Zheng, Jie Zhou, and Jiwen Lu

PDF

Open Access 1 Repo 5 Reviews

TL;DR

GlobalMamba introduces a novel global image serialization method using DCT to transform images into sequences that preserve 2D structural information, enhancing vision mamba models' ability to capture global context.

Contribution

The paper proposes a new global image serialization technique using DCT, enabling vision mamba models to better exploit global information and causal relations in images.

Findings

01

Improved image classification accuracy on ImageNet-1K

02

Enhanced object detection performance on COCO

03

Better semantic segmentation results on ADE20K

Abstract

Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing, which ignore the intrinsic 2D structural correlations of images. It is also difficult to extract global information by sequential processing of local patches. In this paper, we propose a global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image. We first convert the image from the spatial domain to the frequency domain using Discrete Cosine Transform (DCT) and then arrange the pixels with corresponding frequency ranges. We further transform each set within the same frequency band back to the spatial domain…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 5

Strengths

1. GlobalMamba orders image tokens by frequency, enabling the model to first capture basic structures like contours and then add finer details, mimicking the way "humans process visual information".

Weaknesses

1. Lack of reporting inference time. Since the feature maps in stage 1 of GlobalMamba-T/S/B have high resolution, the inference time would be the concern. Can you report the inference time? 2. Limited performance gain. Compared with Vmamba, GlobalMamba brings limited performance gain in classification, segmentation, detection. I am happy to raise my score if the authors can address my concerns.

Reviewer 02Rating 5Confidence 4

Strengths

- Overall, the paper is well-written and easy to follow. - The motivation is clear. The proposed approach is reasonable.

Weaknesses

- The method requires more FLOPs than V-Mamba. The performance improvement over V-Mamba is somewhat limited. On most benchmarks, the improvement over V-Mamba is only 0.2%-0.3%. - In Table 2, The result of GlobalMamba-M (Mini) is marked in bold. However, GlobalMamba-M (Mini) seems to have lower classification accuracy than PlainMamba-L1, EffVMamba-T and EffVMamba-S. Especially, EffVMamba-T is also more efficient in Params and FLOPs. Why EffVMamba is more efficient in the mini size?

Reviewer 03Rating 5Confidence 3

Strengths

The idea of frequency-based tokenization is interesting. It enables the model to retain more global context, particularly in the low-frequency bands, aligning with the frequency principle and mimicking human visual processing.

Weaknesses

1. Marginal Performance Improvement. The observed accuracy gains are relatively minor, often below 1%, despite the increase in model complexity and especially the increase of FLOPs. 2. Higher Computational Cost. While downsampling can mitigate some of the overhead, the sequence length and computational demand increase, raising questions about efficiency versus benefit for limited gains. 3. Complexity is also Higher. The frequency-based segmentation and DCT/IDCT transformations add structural c

Reviewer 04Rating 5Confidence 4

Strengths

+ The manuscript is clear and well-structured, making it easy for readers to understand its content. + The proposed frequency-based image serialization method is straightforward and can be readily reimplemented.

Weaknesses

- The exploration of extracting information from the frequency domain in relation to mamba schemes is one topic of recent research efforts [A, B]. This paper introduces an alternative method for utilizing frequency information. However, the experimental findings suggest that the performance improvements achieved through the proposed GlobalMamba are relatively minimal. Notably, the increase in percentage in FLOPS surpasses the enhancement in accuracy, raising questions about the actual benefits o

Reviewer 05Rating 5Confidence 4

Strengths

S1: The method provides quite significantly stronger results compared to other Mamba-based SSNs, with results for both image level and dense prediction tasks and models for various model capacities. S2: Tokenization is an important and often overlooked component in both transformers and SSMs for vision tasks, and the approach has merit as a pyramidal tokenizer for both modelling paradigms (SSNs and ViTs). This reviewer appreciates the author's contribution to this field. S3: The approach utili

Weaknesses

W1: In contrast to how the method is motivated, the method does not subdivide the image into frequency bands, but instead employs a low-pass image pyramid. This has the effect of increasing the overall number of tokens in the sequence compared to other baselines, which is not mentioned in the paper. In the end, the tokenization method mimics classic Gaussian pyramids (which are not cited or referenced in the work). W2: While the authors argue that the approach provides “causal tokens”, the gro

Code & Models

Repositories

wangck20/globalmamba
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques

MethodsDiscrete Cosine Transform · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Sparse Evolutionary Training