Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity
Wenbin Zhu (Shandong University), Zhaoyan Shen (Shandong University), Zili Shao (The Chinese University of Hong Kong), Hongjun Dai (Shandong University), and Feng Chen (Indiana University Bloomington)

TL;DR
Tangram is a system that significantly reduces cold-start latency in serverless LLM deployments by reusing GPU memory efficiently, enabling faster model loading and improved resource utilization.
Contribution
It introduces a novel GPU memory reuse approach with tensor sharing, dynamic cache allocation, and affinity-aware scheduling for accelerating serverless LLM loading.
Findings
Achieves up to 6.2x faster model loading
Reduces cold-start Time-To-First-Token by 23-55%
Demonstrates effective GPU memory reuse in prototype implementation
Abstract
Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Big Data and Digital Economy
