MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

Wei Hua; Chenlin Zhou; Jibin Wu; Yansong Chua; Yangyang Shu

arXiv:2505.14719·cs.CV·June 19, 2025

MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu

PDF

Open Access 1 Repo

TL;DR

MSVIT introduces a multi-scale spiking attention mechanism to improve feature extraction in SNN-based vision transformers, achieving state-of-the-art performance across multiple datasets.

Contribution

The paper proposes MSVIT, a novel spiking transformer architecture with multi-scale attention, addressing feature extraction bottlenecks in existing SNN-transformer models.

Findings

01

MSVIT outperforms existing SNN-based models on main datasets.

02

Multi-scale spiking attention enhances feature extraction.

03

Achieves state-of-the-art results among SNN-transformer architectures.

Abstract

The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has garnered significant attention due to their potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT. This novel spike-driven Transformer architecture firstly uses multi-scale spiking attention (MSSA) to enhance the capabilities of spiking attention blocks. We validate our approach across various main datasets. The experimental results show that MSVIT outperforms existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nanhu-ai-lab/msvit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing · Infrared Target Detection Methodologies

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax