Efficient Deployment of Large Language Models on Resource-constrained   Devices

Zhiwei Yao; Yang Xu; Hongli Xu; Yunming Liao; Zuan Xie

arXiv:2501.02438·cs.LG·January 7, 2025·3 cites

Efficient Deployment of Large Language Models on Resource-constrained Devices

Zhiwei Yao, Yang Xu, Hongli Xu, Yunming Liao, Zuan Xie

PDF

Open Access

TL;DR

FedSpine is a federated learning framework that combines parameter-efficient fine-tuning, structured pruning, and adaptive device-specific adjustments to enable efficient deployment of large language models on resource-constrained devices, maintaining accuracy and reducing latency.

Contribution

The paper introduces FedSpine, a novel federated learning framework that integrates parameter-efficient fine-tuning, structured pruning, and adaptive algorithms to optimize large language model deployment on heterogeneous, resource-limited devices.

Findings

01

Speeds up fine-tuning by 1.4× to 6.9×

02

Improves final accuracy by 0.4% to 4.5%

03

Maintains higher inference accuracy with reduced resource usage

Abstract

Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Machine Learning and Algorithms

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning