SVFormer: A Direct Training Spiking Transformer for Efficient Video   Action Recognition

Liutao Yu; Liwei Huang; Chenlin Zhou; Han Zhang; Zhengyu Ma; Huihui; Zhou; Yonghong Tian

arXiv:2406.15034·cs.CV·June 24, 2024·2 cites

SVFormer: A Direct Training Spiking Transformer for Efficient Video Action Recognition

Liutao Yu, Liwei Huang, Chenlin Zhou, Han Zhang, Zhengyu Ma, Huihui, Zhou, Yonghong Tian

PDF

Open Access

TL;DR

SVFormer is a novel directly trained spiking transformer model that efficiently performs video action recognition with low power consumption and competitive accuracy, addressing limitations of previous neural network approaches.

Contribution

The paper introduces SVFormer, a new spiking transformer architecture that combines local and global feature extraction for efficient video action recognition.

Findings

01

Achieves 84.03% accuracy on UCF101 dataset.

02

Consumes only 21 mJ per video, demonstrating ultra-low power efficiency.

03

Performs comparably to mainstream models with less computational cost.

Abstract

Video action recognition (VAR) plays crucial roles in various domains such as surveillance, healthcare, and industrial automation, making it highly significant for the society. Consequently, it has long been a research spot in the computer vision field. As artificial neural networks (ANNs) are flourishing, convolution neural networks (CNNs), including 2D-CNNs and 3D-CNNs, as well as variants of the vision transformer (ViT), have shown impressive performance on VAR. However, they usually demand huge computational cost due to the large data volume and heavy information redundancy introduced by the temporal dimension. To address this challenge, some researchers have turned to brain-inspired spiking neural networks (SNNs), such as recurrent SNNs and ANN-converted SNNs, leveraging their inherent temporal dynamics and energy efficiency. Yet, current SNNs for VAR also encounter limitations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · EEG and Brain-Computer Interfaces · Human Pose and Action Recognition

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Multi-Head Attention · Residual Connection · Convolution · Vision Transformer