A Separable Self-attention Inspired by the State Space Model for Computer Vision
Juntao Zhang, Shaogeng Liu, Kun Bian, You Zhou, Pei Zhang, Jianning Liu, Jun Zhou, Bingyan Liu

TL;DR
This paper introduces VMINet, a novel architecture that incorporates a separable self-attention mechanism inspired by state space models, achieving competitive performance in image classification and dense prediction tasks.
Contribution
It presents a new separable self-attention method inspired by Mamba state space models and a simple prototype architecture, VMINet, for computer vision tasks.
Findings
Achieved competitive results on image classification.
Demonstrated effectiveness in high-resolution dense prediction.
Introduced a novel attention mechanism inspired by state space models.
Abstract
Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive…
Peer Reviews
Decision·Submitted to ICLR 2026
This paper analyzes different designs principles of self-attention, vision mamba, separable self-attention, and conclude the results into four rules to guide the design of the vision models.
1. The involvement of causal mask does not make sense for most of the vision tasks since there are no causal hypotheses in the spatial dimension of the images and videos. That is why Mamba models [1,2,3] in vision need to define one or several complicated scanning sequence to ensure the visual signals are correctly modeled. In VMI-SA, the authors use two set of learnable gating parameters $\alpha$s and $\beta$s to control the proportion between the causal contexts and the direct contexts. It i
1. This paper introduces an interesting model, which incorporates separate self-attention modules into the Mamba marco design. 2. The experiment are conducted on competitive benchmarks, e.g., ImageNet, COCO and ADE20K. 3. The final model have linear complexity, which is a very promising research topic to explore.
1. The novelty is limited. This paper incorporated the minor design in separable self-attention into the Mamba marco design, titled Mamba Inspired Separable self-Attention. It is very similar to MLLA (Mamba-Inspired Linear Attention)[1] , which incroporate the Mamba minor design into the vision transformer marco design. 2. The paper also lacks the method comparsion and performance comparsion with MLLA[1]. 3. Although the authors claim this is a linear model, the performance when the token length
The evaluation is quite convicing. The comparison with ViM models shows that VMiNet and VMIFormer achieve superior performance over ViM variants. Also, the ablation study of mask in Appendix demonstrates the importance of mask operation.
It is not clear what the authors really adopt from SSM to this proposed model. The explation between Eq. 9 and Eq. 10 in not clear. Also, the efficiency analysis is too limited. Efficient VMamba shows the least FLOPS with longer latency and the explatnion is "nsufficient GPU utilization in EfficientVMamba’s SSM module during shorter sequence processing." Does it mean the results would be different on longer sequences? Also, the comparison does not include Flatten Transformer.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Dropout · Linear Layer · Softmax · Adam · Residual Connection · Multi-Head Attention
