SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding
Weihong Xu, Haein Choi, Po-kai Hsu, Shimeng Yu, and Tajana Rosing

TL;DR
SLIM is a co-designed algorithm-hardware system that exploits sparsity in large language models to enable efficient, low-energy inference on edge devices by reducing data movement and leveraging heterogeneous processing architectures.
Contribution
SLIM introduces an adaptive thresholding algorithm and a heterogeneous hardware architecture that together enable efficient sparse LLM inference on resource-constrained edge devices.
Findings
Achieves 13-18x throughput improvement over SSD-GPU systems.
Attains 9-10x better energy efficiency over DRAM-GPU systems.
Maintains low latency with negligible accuracy loss.
Abstract
Large language models (LLMs) have demonstrated exceptional proficiency in understanding and generating human language, but efficient inference on resource-constrained embedded devices remains challenging due to large model sizes and memory-intensive operations in feedforward network (FFN) and multi-head attention (MHA) layers. While existing accelerators offload LLM inference to expensive heterogeneous computing systems, they fail to exploit the significant sparsity inherent in LLM operations, leaving hardware resources underutilized. We propose SLIM, an algorithm-hardware co-design optimized for sparse LLM serving on edge devices. SLIM exploits LLM sparsity through an adaptive thresholding algorithm that enables runtime-configurable sparsity with negligible accuracy loss, fetching only activated neurons to dramatically reduce data movement. Our heterogeneous hardware architecture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis
