BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

Bodun Hu; Jiamin Li; Le Xu; Myungjin Lee; Akshay Jajoo; Geon-Woo Kim,; Hong Xu; Aditya Akella

arXiv:2404.18322·cs.DC·September 25, 2024

BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

Bodun Hu, Jiamin Li, Le Xu, Myungjin Lee, Akshay Jajoo, Geon-Woo Kim,, Hong Xu, Aditya Akella

PDF

Open Access

TL;DR

BlockLLM is a novel multi-tenant serving system for LLMs that uses component sharing and fine-grained model partitioning to enhance efficiency, reduce resource usage, and improve latency and GPU utilization.

Contribution

It introduces a flexible, component-based LLM serving system with dynamic assembly, cache coordination, and locality-aware placement, advancing multi-tenant LLM deployment.

Findings

01

Reduces memory and storage footprints.

02

Improves latency by 95th percentile.

03

Increases GPU utilization by 20.1%.

Abstract

The increasing demand for Large Language Models (LLMs) across various applications has led to a significant shift in the design of deep learning serving systems. Deploying LLMs, particularly in multi-tenant environments, poses substantial challenges due to their high computational and memory demands. We introduce BlockLLM, a serving system that leverages component sharing among fine-tuned LLM models to provide an efficient and flexible solution for LLM workloads. BlockLLM partitions models into finer-grained blocks, enabling the reuse of model components and independent provisioning to improve computation efficiency. BlockLLM comprises an offline block zoo for storing blocks and an online system to serve requests through chains of blocks. It offers multi-fold flexibilities: (1) Adaptive assembly of blocks on-the-fly through equivalence evaluation among blocks in the zoo; (2) Per-block…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis