CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration
Hongpeng Jin, Yanzhao Wu

TL;DR
CE-CoLLM introduces a cloud-edge collaborative framework for large language models that reduces latency and offloads computation to the edge, enabling efficient and adaptive inference in diverse environments.
Contribution
The paper presents novel techniques like latency-aware early exit and adaptive inference modes to optimize LLM deployment at the edge, addressing communication bottlenecks and reliability issues.
Findings
Reduces inference time by up to 13.81%.
Offloads over 84.53% of computation to the edge.
Maintains prediction accuracy with reduced communication overhead.
Abstract
Large Language Models (LLMs) exhibit remarkable human-like predictive capabilities. However, it is challenging to deploy LLMs to provide efficient and adaptive inference services at the edge. This paper proposes a novel Cloud-Edge Collaboration framework for LLMs (CE-CoLLM) to tackle these challenges. First, we identify the transmission of LLM contextual data between the cloud and edge as a key performance bottleneck, which introduces substantial communication overhead that dominates overall inference latency and makes na\"ive cloud-edge collaboration for LLMs inefficient. Second, we introduce a suite of novel techniques, including a latency-aware early exit mechanism and efficient cloud context management, into CE-CoLLM, which collectively reduce communication overhead and preserve LLM inference accuracy. Third, we design two adaptive inference modes to accommodate diverse edge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
