BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning
Hongxiang Peng, Dewei Bai, Hong Qu

TL;DR
BSViT introduces a novel burst spiking vision transformer with dual-channel self-attention and local masking, significantly improving accuracy and efficiency in visual learning tasks.
Contribution
It proposes a dual-channel burst spiking self-attention mechanism and patch adjacency masking to enhance representational capacity and reduce computation in spiking vision transformers.
Findings
BSViT outperforms existing spiking Transformers on static and event-based benchmarks.
The model maintains energy efficiency through addition-only operations.
Incorporating burst spike coding increases spike-level representational capacity.
Abstract
Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
