Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Yuqi Luo; Chenyang Song; Xu Han; Yingfa Chen; Chaojun Xiao; Xiaojun Meng; Liqun Deng; Jiansheng Wei; Zhiyuan Liu; Maosong Sun

arXiv:2411.02335·cs.LG·July 1, 2025

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, Maosong Sun

PDF

Open Access 1 Repo 10 Models 1 Video

TL;DR

This paper investigates the factors influencing activation sparsity in large language models, proposing a new metric and revealing empirical laws that can guide the development of more efficient and interpretable LLMs.

Contribution

It introduces PPL-$p ext{ extbackslash}%$ sparsity, a novel quantitative metric for activation sparsity applicable to any activation function, and provides comprehensive empirical insights into sparsity scaling laws.

Findings

01

ReLU achieves higher activation sparsity than SiLU with more training data.

02

Activation ratio follows power-law and logspace power-law growth with training data for SiLU and ReLU.

03

Deeper architectures at fixed parameters can enhance activation sparsity.

Abstract

Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL- $p %$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/SparsingLaw
pytorchOfficial

Models

Videos

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Linear Unit