EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting
Zhongzhi Yu, Zheng Wang, Yuhan Li, Haoran You, Ruijie Gao, Xiaoya, Zhou, Sreenidhi Reedy Bommu, Yang Katie Zhao, Yingyan Celine Lin

TL;DR
Edge-LLM is a novel framework that enables efficient adaptation of large language models on edge devices by combining layer-wise compression, adaptive layer tuning, and hardware-aware scheduling, significantly reducing computation and memory overheads.
Contribution
The paper introduces Edge-LLM, a comprehensive framework that integrates layer-wise compression, adaptive tuning, and hardware scheduling for efficient LLM adaptation on edge devices, addressing limitations of existing methods.
Findings
Achieves 2.92x speedup over vanilla tuning methods.
Reduces memory overhead by 4x while maintaining accuracy.
Demonstrates effectiveness through extensive experiments.
Abstract
Efficient adaption of large language models (LLMs) on edge devices is essential for applications requiring continuous and privacy-preserving adaptation and inference. However, existing tuning techniques fall short because of the high computation and memory overheads. To this end, we introduce a computation- and memory-efficient LLM tuning framework, called Edge-LLM, to facilitate affordable and effective LLM adaptation on edge devices. Specifically, Edge-LLM features three core components: (1) a layer-wise unified compression (LUC) technique to reduce the computation overhead by generating layer-wise pruning sparsity and quantization bit-width policies, (2) an adaptive layer tuning and voting scheme to reduce the memory overhead by reducing the backpropagation depth, and (3) a complementary hardware scheduling strategy to handle the irregular computation patterns introduced by LUC and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning
