BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
Bodun Hu, Jiamin Li, Le Xu, Myungjin Lee, Akshay Jajoo, Geon-Woo Kim,, Hong Xu, Aditya Akella

TL;DR
BlockLLM is a novel multi-tenant serving system for LLMs that uses component sharing and fine-grained model partitioning to enhance efficiency, reduce resource usage, and improve latency and GPU utilization.
Contribution
It introduces a flexible, component-based LLM serving system with dynamic assembly, cache coordination, and locality-aware placement, advancing multi-tenant LLM deployment.
Findings
Reduces memory and storage footprints.
Improves latency by 95th percentile.
Increases GPU utilization by 20.1%.
Abstract
The increasing demand for Large Language Models (LLMs) across various applications has led to a significant shift in the design of deep learning serving systems. Deploying LLMs, particularly in multi-tenant environments, poses substantial challenges due to their high computational and memory demands. We introduce BlockLLM, a serving system that leverages component sharing among fine-tuned LLM models to provide an efficient and flexible solution for LLM workloads. BlockLLM partitions models into finer-grained blocks, enabling the reuse of model components and independent provisioning to improve computation efficiency. BlockLLM comprises an offline block zoo for storing blocks and an online system to serve requests through chains of blocks. It offers multi-fold flexibilities: (1) Adaptive assembly of blocks on-the-fly through equivalence evaluation among blocks in the zoo; (2) Per-block…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
