Towards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head   Attention Spiking Transformers

Boxun Xu; Junyoung Hwang; Pruek Vanna-iampikul; Yuxuan Yin; Sung Kyu; Lim; Peng Li

arXiv:2412.05540·cs.NE·December 10, 2024

Towards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head Attention Spiking Transformers

Boxun Xu, Junyoung Hwang, Pruek Vanna-iampikul, Yuxuan Yin, Sung Kyu, Lim, Peng Li

PDF

Open Access

TL;DR

This paper presents a novel 3D hardware architecture for spiking transformers with mixture-of-experts and multi-head attention, significantly improving energy efficiency and latency for brain-inspired deep learning models.

Contribution

It introduces the first 3D hardware design methodology for spiking transformers, enabling highly parallel processing inspired by neural systems.

Findings

01

Significant energy efficiency improvements over 2D CMOS designs

02

Reduced latency in spiking transformer computations

03

Effective 3D integration with memory-on-logic and logic-on-logic stacking

Abstract

Spiking Neural Networks(SNNs) provide a brain-inspired and event-driven mechanism that is believed to be critical to unlock energy-efficient deep learning. The mixture-of-experts approach mirrors the parallel distributed processing of nervous systems, introducing conditional computation policies and expanding model capacity without scaling up the number of computational operations. Additionally, spiking mixture-of-experts self-attention mechanisms enhance representation capacity, effectively capturing diverse patterns of entities and dependencies between visual or linguistic tokens. However, there is currently a lack of hardware support for highly parallel distributed processing needed by spiking transformers, which embody a brain-inspired computation. This paper introduces the first 3D hardware architecture and design methodology for Mixture-of-Experts and Multi-Head Attention spiking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors · Ferroelectric and Negative Capacitance Devices

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention