SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding

Weihong Xu; Haein Choi; Po-kai Hsu; Shimeng Yu; and Tajana Rosing

arXiv:2507.09201·cs.AR·July 15, 2025

SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding

Weihong Xu, Haein Choi, Po-kai Hsu, Shimeng Yu, and Tajana Rosing

PDF

Open Access

TL;DR

SLIM is a co-designed algorithm-hardware system that exploits sparsity in large language models to enable efficient, low-energy inference on edge devices by reducing data movement and leveraging heterogeneous processing architectures.

Contribution

SLIM introduces an adaptive thresholding algorithm and a heterogeneous hardware architecture that together enable efficient sparse LLM inference on resource-constrained edge devices.

Findings

01

Achieves 13-18x throughput improvement over SSD-GPU systems.

02

Attains 9-10x better energy efficiency over DRAM-GPU systems.

03

Maintains low latency with negligible accuracy loss.

Abstract

Large language models (LLMs) have demonstrated exceptional proficiency in understanding and generating human language, but efficient inference on resource-constrained embedded devices remains challenging due to large model sizes and memory-intensive operations in feedforward network (FFN) and multi-head attention (MHA) layers. While existing accelerators offload LLM inference to expensive heterogeneous computing systems, they fail to exploit the significant sparsity inherent in LLM operations, leaving hardware resources underutilized. We propose SLIM, an algorithm-hardware co-design optimized for sparse LLM serving on edge devices. SLIM exploits LLM sparsity through an adaptive thresholding algorithm that enables runtime-configurable sparsity with negligible accuracy loss, fetching only activated neurons to dramatically reduce data movement. Our heterogeneous hardware architecture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis