Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, Maosong Sun

TL;DR
This paper investigates the factors influencing activation sparsity in large language models, proposing a new metric and revealing empirical laws that can guide the development of more efficient and interpretable LLMs.
Contribution
It introduces PPL-$p ext{ extbackslash}%$ sparsity, a novel quantitative metric for activation sparsity applicable to any activation function, and provides comprehensive empirical insights into sparsity scaling laws.
Findings
ReLU achieves higher activation sparsity than SiLU with more training data.
Activation ratio follows power-law and logspace power-law growth with training data for SiLU and ReLU.
Deeper architectures at fixed parameters can enhance activation sparsity.
Abstract
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL- sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗SparseLLM/sparsing-law-0.1b-relumodel· 4 dl· ♡ 24 dl♡ 2
- 🤗SparseLLM/sparsing-law-0.1b-silumodel· 7 dl7 dl
- 🤗SparseLLM/sparsing-law-0.2b-relumodel· 5 dl5 dl
- 🤗SparseLLM/sparsing-law-0.8b-relumodel· 2 dl2 dl
- 🤗SparseLLM/sparsing-law-0.4b-relumodel
- 🤗SparseLLM/sparsing-law-1.2b-relumodel· 1 dl1 dl
- 🤗SparseLLM/sparsing-law-0.2b-silumodel· 1 dl1 dl
- 🤗SparseLLM/sparsing-law-0.4b-silumodel· 1 dl1 dl
- 🤗SparseLLM/sparsing-law-0.8b-silumodel· 1 dl1 dl
- 🤗SparseLLM/sparsing-law-1.2b-silumodel· 1 dl1 dl
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Linear Unit
