Edge Intelligence Optimization for Large Language Model Inference with   Batching and Quantization

Xinyuan Zhang; Jiang Liu; Zehui Xiong; Yudong Huang; Gaochang Xie; Ran; Zhang

arXiv:2405.07140·cs.LG·May 14, 2024·1 cites

Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization

Xinyuan Zhang, Jiang Liu, Zehui Xiong, Yudong Huang, Gaochang Xie, Ran, Zhang

PDF

Open Access

TL;DR

This paper proposes an optimization framework for deploying large language models on edge devices using batching and quantization, improving inference throughput while considering resource constraints and user requirements.

Contribution

It introduces a novel edge inference optimization problem for LLMs, with a new algorithm (DFTSP) that enhances throughput and reduces complexity compared to existing methods.

Findings

01

DFTSP outperforms batching benchmarks in throughput.

02

DFTSP reduces time complexity by over 45%.

03

Simulation shows improved resource utilization and latency management.

Abstract

Generative Artificial Intelligence (GAI) is taking the world by storm with its unparalleled content creation ability. Large Language Models (LLMs) are at the forefront of this movement. However, the significant resource demands of LLMs often require cloud hosting, which raises issues regarding privacy, latency, and usage limitations. Although edge intelligence has long been utilized to solve these challenges by enabling real-time AI computation on ubiquitous edge resources close to data sources, most research has focused on traditional AI models and has left a gap in addressing the unique characteristics of LLM inference, such as considerable model size, auto-regressive processes, and self-attention mechanisms. In this paper, we present an edge intelligence optimization problem tailored for LLM inference. Specifically, with the deployment of the batching technique and model quantization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling