Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge   Devices

Ruiyang Qin; Dancheng Liu; Chenhui Xu; Zheyu Yan; Zhaoxuan Tan; Zhenge; Jia; Amir Nassereldine; Jiajie Li; Meng Jiang; Ahmed Abbasi; Jinjun Xiong,; Yiyu Shi

arXiv:2406.03777·cs.LG·October 3, 2024·3 cites

Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

Ruiyang Qin, Dancheng Liu, Chenhui Xu, Zheyu Yan, Zhaoxuan Tan, Zhenge, Jia, Amir Nassereldine, Jiajie Li, Meng Jiang, Ahmed Abbasi, Jinjun Xiong,, Yiyu Shi

PDF

Open Access

TL;DR

This paper empirically investigates how to effectively deploy large language models on resource-limited edge devices, providing practical guidelines based on extensive experiments for optimizing customization and inference.

Contribution

It offers novel empirical guidelines for deploying LLMs on resource-constrained devices, considering various design tradeoffs and their impacts on learning efficiency and accuracy.

Findings

01

Optimal parameter learning vs. RAG depends on task difficulty

02

Longer fine-tuning time may not improve performance

03

Compressed LLMs can outperform uncompressed ones with limited data

Abstract

The scaling laws have become the de facto guidelines for designing large language models (LLMs), but they were studied under the assumption of unlimited computing resources for both training and inference. As LLMs are increasingly used as personalized intelligent assistants, their customization (i.e., learning through fine-tuning) and deployment onto resource-constrained edge devices will become more and more prevalent. An urging but open question is how a resource-constrained computing environment would affect the design choices for a personalized LLM. We study this problem empirically in this work. In particular, we consider the tradeoffs among a number of key design factors and their intertwined impacts on learning efficiency and accuracy. The factors include the learning methods for LLM customization, the amount of personalized data used for learning customization, the types and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Data Stream Mining Techniques · Semantic Web and Ontologies

MethodsAttention Is All You Need · WordPiece · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Softmax · Layer Normalization · BERT · Linear Layer · Byte Pair Encoding