CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM   Inference

Suyi Li; Hanfeng Lu; Tianyuan Wu; Minchen Yu; Qizhen Weng; Xusheng; Chen; Yizhou Shan; Binhang Yuan; Wei Wang

arXiv:2401.11240·cs.DC·January 23, 2024·1 cites

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference

Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng, Chen, Yizhou Shan, Binhang Yuan, Wei Wang

PDF

Open Access 1 Repo

TL;DR

CaraServe is a system that enhances LoRA adapter serving for large language models by using CPU-assisted loading and rank-aware scheduling, significantly reducing latency and improving service level objectives.

Contribution

It introduces a novel CPU-assisted loading mechanism and a rank-aware scheduling algorithm for efficient LoRA adapter serving in LLM inference.

Findings

01

Speeds up request serving latency by up to 1.4 times

02

Achieves up to 99% service level objective attainment

03

Outperforms state-of-the-art LoRA serving systems

Abstract

Pre-trained large language models (LLMs) often need specialization for domain-specific tasks. Low-Rank Adaptation (LoRA) is a popular approach that adapts a base model to multiple tasks by adding lightweight trainable adapters. In this paper, we present CaraServe, a system that efficiently serves many LoRA adapters derived from a common base model. CaraServe maintains the base model on GPUs and dynamically loads activated LoRA adapters from main memory. As GPU loading results in a cold-start that substantially delays token generation, CaraServe employs a CPU-assisted approach. It early starts the activated adapters on CPUs for prefilling as they are being loaded onto GPUs; after loading completes, it then switches to the GPUs for generative LoRA inference. CaraServe develops a highly optimized synchronization mechanism to efficiently coordinate LoRA computation on the CPU and GPU.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

modeltc/lightllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning