FireFly-T: High-Throughput Sparsity Exploitation for Spiking Transformer Acceleration with Dual-Engine Overlay Architecture
Tenglong Li, Jindong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng

TL;DR
FireFly-T is a novel dual-engine overlay architecture that significantly accelerates spiking transformer computations by exploiting sparsity and optimizing dataflows, achieving substantial improvements in energy and DSP efficiency.
Contribution
The paper introduces FireFly-T, a dual-engine overlay architecture with a sparse engine and a binary engine, enabling scalable, high-throughput acceleration of spiking transformers on FPGA hardware.
Findings
Achieves 1.39x and 2.40x higher energy efficiency compared to prior accelerators.
Attains 4.21x and 7.10x greater DSP efficiency over previous designs.
Demonstrates effective exploitation of fine-grained sparsity and scalable parallelism.
Abstract
Spiking transformers are emerging as a promising architecture that combines the energy efficiency of Spiking Neural Networks (SNNs) with the powerful attention mechanisms of transformers. However, existing hardware accelerators lack support for spiking attention, exhibit limited throughput in exploiting fine-grained sparsity, and struggle with scalable parallelism in sparse computation. To address these, we propose FireFly-T, a dual-engine overlay architecture that integrates a sparse engine for activation sparsity and a binary engine for spiking attention. In the sparse engine, we propose a highthroughput sparse decoder that exploits fine-grained sparsity by concurrently extracting multiple non-zero spikes. To complement this, we introduce a scalable load balancing mechanism with weight dispatch and out-of-order execution, eliminating bank conflicts to support scalable multidimensional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Parallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices
