Training-Free Activation Sparsity in Large Language Models
James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben, Athiwaratkun

TL;DR
TEAL introduces a simple, training-free magnitude-based activation sparsity method for large language models, achieving significant speedups and efficiency improvements with minimal performance loss across various model sizes.
Contribution
It presents TEAL, a novel training-free approach to induce activation sparsity in large language models, enabling faster inference without extensive retraining.
Findings
Achieves 40-50% model-wide sparsity with minimal performance loss.
Provides up to 1.8x decoding speed-up at 50% sparsity.
Compatible with weight quantization for additional efficiency.
Abstract
Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53 and 1.8 at 40% and 50%…
Peer Reviews
Decision·ICLR 2025 Spotlight
- TEAL achieves high sparsity (40-50%) without retraining, resulting in faster execution (up to 1.8× speed-up) with minimal performance degradation. - A range of experiments across various models and model sizes shows that TEAL consistently outperforms other methods, demonstrating its robustness and general applicability - The proposed method is designed to work seamlessly with newer models that use modern activation functions like SwiGLU, making it applicable to the latest LLM architectures.
- Although TEAL is adapted for compatibility with newer activation functions like SwiGLU, the core concept of magnitude-based pruning is similar to prior work, which may limit its perceived innovation and contribution. - The paper does not mention whether the code is available for reproducibility.
* The paper successively addresses an important and relevant applied problem: how to speed up inference of **contemporary** and widely used architectures of LLMs such as Llama model family with no or minimal quality loss. * Suggested solution is conceptually simple, and authors supply custom Triton kernels which makes their method suitable for easy adoption. * The paper is clear, legible and easy to follow. The results are solid, and the contribution is sound. * The authors analyze and visu
1. It would be nice to include formal definitions to various matrices (such as $W_{up}$ and $W_{down}$) and layers (MLP, SwiGLU) which are referred to throughout the work. 2. There are no comparisons with structured and semi-structured pruning (e.g. Mask-LLM [1] which was originally evaluated on the same benchmarks with Llama-2 models). Overall, it would be intriguing and instructive to compare differences in quality/ speed, and trade-offs between the TEAL and unstructured/ semi-structured/ str
- The paper provides a high-quality exploration of sparsity patterns in SiLU-like activations, which are widely used in modern LLMs. - The greedy search-based solution presented is straightforward yet effective. - The authors demonstrate the accuracy results on several common downstream tasks at various scales LLMs, and the end-to-end throughput experiments show the practical value of TEAL in certain scenarios.
- While the main focus is on Llama-like LLMs, discussing the applicability of TEAL to other architectures, such as [Mixture-of-Experts](https://arxiv.org/abs/1701.06538) and [Mamba](https://arxiv.org/abs/2312.00752), would enhance the paper's scope. - The paper lacks a detailed analysis of the memory footprint of the TEAL method, especially in long context scenarios.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
