DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference
Jiawen Qi, Chang Gao, Zhaochun Ren, Qinyu Chen

TL;DR
DeltaLLM is a training-free framework that leverages temporal sparsity in attention patterns to enable efficient large language model inference on resource-constrained edge devices, improving speed and reducing computation without retraining.
Contribution
It introduces a novel delta matrix construction and a hybrid attention mechanism tailored for edge devices, enabling significant sparsity and efficiency gains during LLM inference.
Findings
Achieves up to 60% attention sparsity with negligible accuracy loss.
Improves inference efficiency on edge devices without additional training.
Demonstrates effectiveness across multiple language models and tasks.
Abstract
Deploying Large Language Models (LLMs) on edge devices remains challenging due to their quadratically increasing computations with the sequence length. Existing studies for dynamic attention pruning are designed for hardware with massively parallel computation capabilities, such as GPUs or TPUs, and aim at long context lengths (e.g., 64K), making them unsuitable for edge scenarios. We present DeltaLLM, a training-free framework that exploits temporal sparsity in attention patterns to enable efficient LLM inference across both the prefilling and decoding stages, on resource-constrained edge devices. DeltaLLM introduces an accuracy- and memory-aware delta matrix construction strategy that introduces temporal sparsity, and a context-aware hybrid attention mechanism that combines full attention in a local context window with delta approximation outside it to increase accuracy. We evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
