Efficient Deployment of Large Language Models on Resource-constrained Devices
Zhiwei Yao, Yang Xu, Hongli Xu, Yunming Liao, Zuan Xie

TL;DR
FedSpine is a federated learning framework that combines parameter-efficient fine-tuning, structured pruning, and adaptive device-specific adjustments to enable efficient deployment of large language models on resource-constrained devices, maintaining accuracy and reducing latency.
Contribution
The paper introduces FedSpine, a novel federated learning framework that integrates parameter-efficient fine-tuning, structured pruning, and adaptive algorithms to optimize large language model deployment on heterogeneous, resource-limited devices.
Findings
Speeds up fine-tuning by 1.4× to 6.9×
Improves final accuracy by 0.4% to 4.5%
Maintains higher inference accuracy with reduced resource usage
Abstract
Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Machine Learning and Algorithms
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning
