Activation Sparsity Opportunities for Compressing General Large Language Models
Nobel Dhar, Bobin Deng, Md Romyull Islam, Kazi Fahim Ahmad Nasif,, Liang Zhao, Kun Suo

TL;DR
This paper explores activation sparsity as a method to significantly compress large language models on edge devices, achieving around 50% reduction in memory and computation with minimal accuracy loss.
Contribution
It systematically investigates activation sparsity in LLMs, providing a practical guideline for system optimization and demonstrating effective compression of FFN components.
Findings
Achieves ~50% memory and computation reduction in FFN components.
Negligible accuracy degradation with increased activation sparsity.
Provides a system prediction guideline for efficient LLM deployment.
Abstract
Deploying local AI models, such as Large Language Models (LLMs), to edge devices can substantially enhance devices' independent capabilities, alleviate the server's burden, and lower the response time. Owing to these tremendous potentials, many big tech companies have released several lightweight Small Language Models (SLMs) to bridge this gap. However, we still have huge motivations to deploy more powerful (LLMs) AI models on edge devices and enhance their smartness level. Unlike the conventional approaches for AI model compression, we investigate activation sparsity. The activation sparsity method is orthogonal and combinable with existing techniques to maximize the compression rate while maintaining great accuracy. LLMs' Feed-Forward Network (FFN) components, which typically comprise a large proportion of parameters (around 2/3), ensure that our FFN optimizations would have a better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
