Training-Free Activation Sparsity in Large Language Models

James Liu; Pragaash Ponnusamy; Tianle Cai; Han Guo; Yoon Kim; Ben; Athiwaratkun

arXiv:2408.14690·cs.CL·February 27, 2025·3 cites

Training-Free Activation Sparsity in Large Language Models

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben, Athiwaratkun

PDF

Open Access 1 Repo 3 Reviews

TL;DR

TEAL introduces a simple, training-free magnitude-based activation sparsity method for large language models, achieving significant speedups and efficiency improvements with minimal performance loss across various model sizes.

Contribution

It presents TEAL, a novel training-free approach to induce activation sparsity in large language models, enabling faster inference without extensive retraining.

Findings

01

Achieves 40-50% model-wide sparsity with minimal performance loss.

02

Provides up to 1.8x decoding speed-up at 50% sparsity.

03

Compatible with weight quantization for additional efficiency.

Abstract

Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53 $\times$ and 1.8 $\times$ at 40% and 50%…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 6Confidence 2

Strengths

- TEAL achieves high sparsity (40-50%) without retraining, resulting in faster execution (up to 1.8× speed-up) with minimal performance degradation. - A range of experiments across various models and model sizes shows that TEAL consistently outperforms other methods, demonstrating its robustness and general applicability - The proposed method is designed to work seamlessly with newer models that use modern activation functions like SwiGLU, making it applicable to the latest LLM architectures.

Weaknesses

- Although TEAL is adapted for compatibility with newer activation functions like SwiGLU, the core concept of magnitude-based pruning is similar to prior work, which may limit its perceived innovation and contribution. - The paper does not mention whether the code is available for reproducibility.

Reviewer 02Rating 8Confidence 3

Strengths

* The paper successively addresses an important and relevant applied problem: how to speed up inference of **contemporary** and widely used architectures of LLMs such as Llama model family with no or minimal quality loss. * Suggested solution is conceptually simple, and authors supply custom Triton kernels which makes their method suitable for easy adoption. * The paper is clear, legible and easy to follow. The results are solid, and the contribution is sound. * The authors analyze and visu

Weaknesses

1. It would be nice to include formal definitions to various matrices (such as $W_{up}$ and $W_{down}$) and layers (MLP, SwiGLU) which are referred to throughout the work. 2. There are no comparisons with structured and semi-structured pruning (e.g. Mask-LLM [1] which was originally evaluated on the same benchmarks with Llama-2 models). Overall, it would be intriguing and instructive to compare differences in quality/ speed, and trade-offs between the TEAL and unstructured/ semi-structured/ str

Reviewer 03Rating 8Confidence 3

Strengths

- The paper provides a high-quality exploration of sparsity patterns in SiLU-like activations, which are widely used in modern LLMs. - The greedy search-based solution presented is straightforward yet effective. - The authors demonstrate the accuracy results on several common downstream tasks at various scales LLMs, and the end-to-end throughput experiments show the practical value of TEAL in certain scenarios.

Weaknesses

- While the main focus is on Llama-like LLMs, discussing the applicability of TEAL to other architectures, such as [Mixture-of-Experts](https://arxiv.org/abs/1701.06538) and [Mamba](https://arxiv.org/abs/2312.00752), would enhance the paper's scope. - The paper lacks a detailed analysis of the memory footprint of the TEAL method, especially in long context scenarios.

Code & Models

Repositories

fasterdecoding/teal
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques