SVFormer: A Direct Training Spiking Transformer for Efficient Video Action Recognition
Liutao Yu, Liwei Huang, Chenlin Zhou, Han Zhang, Zhengyu Ma, Huihui, Zhou, Yonghong Tian

TL;DR
SVFormer is a novel directly trained spiking transformer model that efficiently performs video action recognition with low power consumption and competitive accuracy, addressing limitations of previous neural network approaches.
Contribution
The paper introduces SVFormer, a new spiking transformer architecture that combines local and global feature extraction for efficient video action recognition.
Findings
Achieves 84.03% accuracy on UCF101 dataset.
Consumes only 21 mJ per video, demonstrating ultra-low power efficiency.
Performs comparably to mainstream models with less computational cost.
Abstract
Video action recognition (VAR) plays crucial roles in various domains such as surveillance, healthcare, and industrial automation, making it highly significant for the society. Consequently, it has long been a research spot in the computer vision field. As artificial neural networks (ANNs) are flourishing, convolution neural networks (CNNs), including 2D-CNNs and 3D-CNNs, as well as variants of the vision transformer (ViT), have shown impressive performance on VAR. However, they usually demand huge computational cost due to the large data volume and heavy information redundancy introduced by the temporal dimension. To address this challenge, some researchers have turned to brain-inspired spiking neural networks (SNNs), such as recurrent SNNs and ANN-converted SNNs, leveraging their inherent temporal dynamics and energy efficiency. Yet, current SNNs for VAR also encounter limitations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · EEG and Brain-Computer Interfaces · Human Pose and Action Recognition
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Multi-Head Attention · Residual Connection · Convolution · Vision Transformer
