EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
Mingjin Zhang, Jiannong Cao, Xiaoming Shen, Zeyang Cui

TL;DR
EdgeShard introduces a collaborative edge computing framework that partitions large language models across devices and cloud to reduce latency and increase throughput, addressing privacy and bandwidth issues.
Contribution
The paper proposes a novel model partitioning and device selection framework for efficient LLM inference on edge-cloud systems, with an adaptive optimization algorithm.
Findings
Achieves up to 50% latency reduction
Doubles throughput compared to baseline methods
Demonstrates effectiveness on Llama2 models
Abstract
Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns. Edge computing is promising to address such concerns by deploying LLMs on edge devices, closer to data sources. Some works try to leverage model quantization to reduce the model size to fit the resource-constraint edge devices, but they lead to accuracy loss. Other works use cloud-edge collaboration, suffering from unstable network connections. In this work, we leverage collaborative edge computing to facilitate the collaboration among edge devices and cloud servers for jointly performing efficient LLM inference. We propose a general framework to partition the LLM model into shards and deploy on distributed devices. To achieve efficient LLM inference,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlockchain Technology Applications and Security · Digital Rights Management and Security · Cloud Data Security Solutions
