SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity
Shihao Zou, Qingfeng Li, Wei Ji, Jingjing Li, Yongkui Yang, Guoqi Li, Chao Dong

TL;DR
SpikeVideoFormer introduces a spike-driven video Transformer with linear temporal complexity and Hamming attention, achieving state-of-the-art performance and efficiency in various video vision tasks.
Contribution
It proposes a novel spike-driven attention mechanism and a linear complexity Transformer for video tasks, advancing SNN applications in video analysis.
Findings
Achieves over 15% improvement on pose tracking and segmentation tasks.
Matches recent ANN methods' performance with significant efficiency gains.
Maintains linear temporal complexity $ ext{O}(T)$ in video processing.
Abstract
Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity . Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Image Enhancement Techniques · CCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Layer Normalization · Focus · Byte Pair Encoding · Softmax · Absolute Position Encodings
