Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start
Xueshen Liu, Yongji Wu, Yuncheng Yao, Danyang Zhuo, Ion Stoica, Z. Morley Mao

TL;DR
Foundry is a system that significantly reduces cold-start latency in large language model serving by offline capturing and online reconstruction of CUDA graphs, enabling faster initialization.
Contribution
It introduces a template-based CUDA graph context materialization approach that persists execution context and enables rapid online reconstruction for LLM serving.
Findings
Reduces cold-start latency by up to 99%.
Cuts initialization time of Qwen3-235B from 10 minutes to 3.9 seconds.
Maintains throughput gains of CUDA graphs during fast startup.
Abstract
Modern LLM service providers increasingly rely on autoscaling and parallelism reconfiguration to respond to rapidly changing workloads, but cold-start latency remains a major bottleneck. While recent systems have reduced model weight loading to seconds, CUDA graph capture still takes tens of seconds to minutes and often dominates startup. Unfortunately, CUDA graphs cannot be naively serialized: beyond graph topology, they are tightly coupled to execution context, including device addresses embedded in kernel arguments and kernel code lazily loaded during warmup. Existing approaches either rely on brittle kernel-specific patching or heavyweight process-level checkpoint/restore that are inflexible to dynamic parallelism switching. We present Foundry, a template-based CUDA graph context materialization system that persists both graph topology and execution context during an offline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
