Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers
Cody Wild, Jesper Anderson

TL;DR
This paper investigates how activation sparsity patterns in ReLU Transformers evolve during training, revealing layer-specific behaviors and the influence of training dynamics on neuron activity.
Contribution
It provides a detailed analysis of layer-dependent sparsity patterns and the mechanisms behind neuron 'turning off' during training in ReLU Transformers.
Findings
Layer-specific sparsity patterns vary across the network.
First and last layers show distinctive, often inverted, sparsity relationships.
Neuron 'death' is primarily driven by training dynamics, not randomness.
Abstract
Previous work has demonstrated that MLPs within ReLU Transformers exhibit high levels of sparsity, with many of their activations equal to zero for any given token. We build on that work to more deeply explore how token-level sparsity evolves over the course of training, and how it connects to broader sparsity patterns over the course of a sequence or batch, demonstrating that the different layers within small transformers exhibit distinctly layer-specific patterns on both of these fronts. In particular, we demonstrate that the first and last layer of the network have distinctive and in many ways inverted relationships to sparsity, and explore implications for the structure of feature representations being learned at different depths of the model. We additionally explore the phenomenon of ReLU dimensions "turning off", and show evidence suggesting that "neuron death" is being primarily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Semiconductor materials and devices · Advancements in Semiconductor Devices and Circuit Design
