A Separable Self-attention Inspired by the State Space Model for Computer Vision

Juntao Zhang; Shaogeng Liu; Kun Bian; You Zhou; Pei Zhang; Jianning Liu; Jun Zhou; Bingyan Liu

arXiv:2501.02040·cs.CV·May 21, 2025

A Separable Self-attention Inspired by the State Space Model for Computer Vision

Juntao Zhang, Shaogeng Liu, Kun Bian, You Zhou, Pei Zhang, Jianning Liu, Jun Zhou, Bingyan Liu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces VMINet, a novel architecture that incorporates a separable self-attention mechanism inspired by state space models, achieving competitive performance in image classification and dense prediction tasks.

Contribution

It presents a new separable self-attention method inspired by Mamba state space models and a simple prototype architecture, VMINet, for computer vision tasks.

Findings

01

Achieved competitive results on image classification.

02

Demonstrated effectiveness in high-resolution dense prediction.

03

Introduced a novel attention mechanism inspired by state space models.

Abstract

Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

This paper analyzes different designs principles of self-attention, vision mamba, separable self-attention, and conclude the results into four rules to guide the design of the vision models.

Weaknesses

1. The involvement of causal mask does not make sense for most of the vision tasks since there are no causal hypotheses in the spatial dimension of the images and videos. That is why Mamba models [1,2,3] in vision need to define one or several complicated scanning sequence to ensure the visual signals are correctly modeled. In VMI-SA, the authors use two set of learnable gating parameters $\alpha$s and $\beta$s to control the proportion between the causal contexts and the direct contexts. It i

Reviewer 02Rating 2Confidence 4

Strengths

1. This paper introduces an interesting model, which incorporates separate self-attention modules into the Mamba marco design. 2. The experiment are conducted on competitive benchmarks, e.g., ImageNet, COCO and ADE20K. 3. The final model have linear complexity, which is a very promising research topic to explore.

Weaknesses

1. The novelty is limited. This paper incorporated the minor design in separable self-attention into the Mamba marco design, titled Mamba Inspired Separable self-Attention. It is very similar to MLLA (Mamba-Inspired Linear Attention)[1] , which incroporate the Mamba minor design into the vision transformer marco design. 2. The paper also lacks the method comparsion and performance comparsion with MLLA[1]. 3. Although the authors claim this is a linear model, the performance when the token length

Reviewer 03Rating 4Confidence 4

Strengths

The evaluation is quite convicing. The comparison with ViM models shows that VMiNet and VMIFormer achieve superior performance over ViM variants. Also, the ablation study of mask in Appendix demonstrates the importance of mask operation.

Weaknesses

It is not clear what the authors really adopt from SSM to this proposed model. The explation between Eq. 9 and Eq. 10 in not clear. Also, the efficiency analysis is too limited. Efficient VMamba shows the least FLOPS with longer latency and the explatnion is "nsufficient GPU utilization in EfficientVMamba’s SSM module during shorter sequence processing." Does it mean the results would be different on longer sequences? Also, the comparison does not include Flatten Transformer.

Code & Models

Repositories

yws-wxs/vminet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Dropout · Linear Layer · Softmax · Adam · Residual Connection · Multi-Head Attention