Towards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head Attention Spiking Transformers
Boxun Xu, Junyoung Hwang, Pruek Vanna-iampikul, Yuxuan Yin, Sung Kyu, Lim, Peng Li

TL;DR
This paper presents a novel 3D hardware architecture for spiking transformers with mixture-of-experts and multi-head attention, significantly improving energy efficiency and latency for brain-inspired deep learning models.
Contribution
It introduces the first 3D hardware design methodology for spiking transformers, enabling highly parallel processing inspired by neural systems.
Findings
Significant energy efficiency improvements over 2D CMOS designs
Reduced latency in spiking transformer computations
Effective 3D integration with memory-on-logic and logic-on-logic stacking
Abstract
Spiking Neural Networks(SNNs) provide a brain-inspired and event-driven mechanism that is believed to be critical to unlock energy-efficient deep learning. The mixture-of-experts approach mirrors the parallel distributed processing of nervous systems, introducing conditional computation policies and expanding model capacity without scaling up the number of computational operations. Additionally, spiking mixture-of-experts self-attention mechanisms enhance representation capacity, effectively capturing diverse patterns of entities and dependencies between visual or linguistic tokens. However, there is currently a lack of hardware support for highly parallel distributed processing needed by spiking transformers, which embody a brain-inspired computation. This paper introduces the first 3D hardware architecture and design methodology for Mixture-of-Experts and Multi-Head Attention spiking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors · Ferroelectric and Negative Capacitance Devices
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention
