EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge   Devices via Layerwise Unified Compression and Adaptive Layer Tuning and   Voting

Zhongzhi Yu; Zheng Wang; Yuhan Li; Haoran You; Ruijie Gao; Xiaoya; Zhou; Sreenidhi Reedy Bommu; Yang Katie Zhao; Yingyan Celine Lin

arXiv:2406.15758·cs.LG·June 25, 2024·1 cites

EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

Zhongzhi Yu, Zheng Wang, Yuhan Li, Haoran You, Ruijie Gao, Xiaoya, Zhou, Sreenidhi Reedy Bommu, Yang Katie Zhao, Yingyan Celine Lin

PDF

Open Access 1 Repo

TL;DR

Edge-LLM is a novel framework that enables efficient adaptation of large language models on edge devices by combining layer-wise compression, adaptive layer tuning, and hardware-aware scheduling, significantly reducing computation and memory overheads.

Contribution

The paper introduces Edge-LLM, a comprehensive framework that integrates layer-wise compression, adaptive tuning, and hardware scheduling for efficient LLM adaptation on edge devices, addressing limitations of existing methods.

Findings

01

Achieves 2.92x speedup over vanilla tuning methods.

02

Reduces memory overhead by 4x while maintaining accuracy.

03

Demonstrates effectiveness through extensive experiments.

Abstract

Efficient adaption of large language models (LLMs) on edge devices is essential for applications requiring continuous and privacy-preserving adaptation and inference. However, existing tuning techniques fall short because of the high computation and memory overheads. To this end, we introduce a computation- and memory-efficient LLM tuning framework, called Edge-LLM, to facilitate affordable and effective LLM adaptation on edge devices. Specifically, Edge-LLM features three core components: (1) a layer-wise unified compression (LUC) technique to reduce the computation overhead by generating layer-wise pruning sparsity and quantization bit-width policies, (2) an adaptive layer tuning and voting scheme to reduce the memory overhead by reducing the backpropagation depth, and (3) a complementary hardware scheduling strategy to handle the irregular computation patterns introduced by LUC and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gatech-eic/edge-llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning