BLITZSCALE: Fast and Live Large Model Autoscaling with O(1) Host Caching

Dingyan Zhang; Haotian Wang; Yang Liu; Xingda Wei; Yizhou Shan; Rong Chen; and Haibo Chen

arXiv:2412.17246·cs.DC·June 17, 2025

BLITZSCALE: Fast and Live Large Model Autoscaling with O(1) Host Caching

Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, and Haibo Chen

PDF

Open Access

TL;DR

BLITZSCALE introduces a novel autoscaling approach for large models that leverages O(1) host caching and network-based parameter loading, enabling faster, live scaling with significant latency and GPU resource improvements.

Contribution

The paper presents a new autoscaling method that combines network-based parameter loading with fine-grained layer-level scaling to achieve rapid, live model scaling without extensive caching.

Findings

01

Up to 94% reduction in tail latency compared to state-of-the-art systems

02

49% GPU time reduction for serving models

03

Effective scaling across multiple hosts with O(1) caching

Abstract

Model autoscaling is the key mechanism to achieve serverless model-as-a-service, but it faces a fundamental trade-off between scaling speed and storage/memory usage to cache parameters, and cannot meet frequent scaling requirements across multiple hosts. The key problem is that data plane performance is slow, and scaled instances remain stopped while parameters are loading. In this paper, we first show that the data plane can be made fast with no or O(1) caching by loading parameters through the compute network between GPUs because: (1) its speed is comparable to host cache and is underutilized, and (2) scaling multiple instances requires no or O(1) caching with network-optimized multicast. Second, autoscaling can be made live by breaking the scaling abstraction for inference from a coarse-grained instance-level to a fine-grained layer-level. This allows us to offload the layer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems