Spark Transformer: Reactivating Sparsity in FFN and Attention

Chong You; Kan Wu; Zhipeng Jia; Lin Chen; Srinadh Bhojanapalli; Jiaxian Guo; Utku Evci; Jan Wassenberg; Praneeth Netrapalli; Jeremiah J. Willcock; Suvinay Subramanian; Felix Chern; Alek Andreev; Shreya Pathak; Felix Yu; Prateek Jain; David E. Culler; Henry M. Levy; Sanjiv Kumar

arXiv:2506.06644·cs.LG·October 24, 2025

Spark Transformer: Reactivating Sparsity in FFN and Attention

Chong You, Kan Wu, Zhipeng Jia, Lin Chen, Srinadh Bhojanapalli, Jiaxian Guo, Utku Evci, Jan Wassenberg, Praneeth Netrapalli, Jeremiah J. Willcock, Suvinay Subramanian, Felix Chern, Alek Andreev, Shreya Pathak, Felix Yu, Prateek Jain, David E. Culler, Henry M. Levy, Sanjiv Kumar

PDF

Open Access

TL;DR

The Spark Transformer introduces a novel architecture that enforces high activation sparsity in both FFN and attention mechanisms, achieving significant efficiency gains while maintaining model quality and training simplicity.

Contribution

It presents a new sparsity method using top-k masking and a hardware-friendly approximate algorithm, improving efficiency without degrading performance.

Findings

01

Achieves only 8% neuron activation in FFN.

02

Reduces FLOPs by 2.5x, speeding up decoding.

03

Maintains competitive benchmark performance.

Abstract

The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity often degrade model quality, increase parameter count, complicate or slow down training. Sparse attention, the application of sparse activation to the attention mechanism, often faces similar challenges. This paper introduces the Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism while maintaining model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices