Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving

Tingyang Sun; Ting He; I-Hong Hou

arXiv:2604.14993·cs.DC·April 17, 2026

Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving

Tingyang Sun, Ting He, I-Hong Hou

PDF

TL;DR

This paper addresses the challenge of serving large foundation models with high memory demands by formulating a novel server chain composition problem, proposing scalable algorithms, and demonstrating significant response time improvements.

Contribution

It introduces the fundamental problem of server chain composition for large memory footprint jobs, proves its NP-hardness, and develops scalable algorithms with performance guarantees.

Findings

01

Significant reduction in response times using the proposed algorithms

02

NP-hardness of the server chain composition problem established

03

Effective load balancing improves large model serving efficiency

Abstract

As a current trend in Artificial Intelligence (AI), large foundation models are increasingly employed as the core of AI services. However, even after training, serving such models at scale remains a challenging task due to their heavy resource footprints, particularly in terms of GPU memory. While recent works revealed unique characteristics of systems serving foundation models that distinguish them from traditional distributed computing systems, there is still a lack of fundamental understanding of the underlying system management problems. This work aims at addressing this gap by extracting a novel problem of "server chain composition" via block placement and cache allocation for serving chainstructured jobs with large memory footprints, which models a fundamental problem in serving large foundation models through pipeline parallelism. After showing the NP-hardness of the optimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.