Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving
Tingyang Sun, Ting He, I-Hong Hou

TL;DR
This paper addresses the challenge of serving large foundation models with high memory demands by formulating a novel server chain composition problem, proposing scalable algorithms, and demonstrating significant response time improvements.
Contribution
It introduces the fundamental problem of server chain composition for large memory footprint jobs, proves its NP-hardness, and develops scalable algorithms with performance guarantees.
Findings
Significant reduction in response times using the proposed algorithms
NP-hardness of the server chain composition problem established
Effective load balancing improves large model serving efficiency
Abstract
As a current trend in Artificial Intelligence (AI), large foundation models are increasingly employed as the core of AI services. However, even after training, serving such models at scale remains a challenging task due to their heavy resource footprints, particularly in terms of GPU memory. While recent works revealed unique characteristics of systems serving foundation models that distinguish them from traditional distributed computing systems, there is still a lack of fundamental understanding of the underlying system management problems. This work aims at addressing this gap by extracting a novel problem of "server chain composition" via block placement and cache allocation for serving chainstructured jobs with large memory footprints, which models a fundamental problem in serving large foundation models through pipeline parallelism. After showing the NP-hardness of the optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
