Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

Wenbin Zhu (Shandong University); Zhaoyan Shen (Shandong University); Zili Shao (The Chinese University of Hong Kong); Hongjun Dai (Shandong University); and Feng Chen (Indiana University Bloomington)

arXiv:2512.01357·cs.DC·December 2, 2025

Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

Wenbin Zhu (Shandong University), Zhaoyan Shen (Shandong University), Zili Shao (The Chinese University of Hong Kong), Hongjun Dai (Shandong University), and Feng Chen (Indiana University Bloomington)

PDF

Open Access

TL;DR

Tangram is a system that significantly reduces cold-start latency in serverless LLM deployments by reusing GPU memory efficiently, enabling faster model loading and improved resource utilization.

Contribution

It introduces a novel GPU memory reuse approach with tensor sharing, dynamic cache allocation, and affinity-aware scheduling for accelerating serverless LLM loading.

Findings

01

Achieves up to 6.2x faster model loading

02

Reduces cold-start Time-To-First-Token by 23-55%

03

Demonstrates effective GPU memory reuse in prototype implementation

Abstract

Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Big Data and Digital Economy