C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG
Shutian Luo, Ali Zafar Sadiq, Rui Yang, Mingye Zhang, Haiying Shen, Wei Wang, Yue Cheng

TL;DR
C2CServe is a novel serverless LLM serving system that leverages high-bandwidth CPU-GPU interconnects to stream model weights from host memory, reducing cold-start latency and improving resource utilization on MIG-enabled GPUs.
Contribution
The paper introduces C2CServe, which utilizes NVLink-C2C to enable request-level model switching without reloading weights, and proposes HybridGEMM and hierarchical scheduling for efficient GPU sharing.
Findings
Reduces cold-start latency by up to 7.1x for dense models.
Maintains over 95% TTFT and TPOT under C2C contention.
Enables efficient model streaming from host memory to MIG instances.
Abstract
Modern LLM serving is increasingly serverless in shape: large model catalogs, long-tail invocations, and multi-tenant demand. Existing GPU serving systems face a tradeoff: dedicated-GPU allocation wastes scarce HBM under sparse traffic, while GPU time sharing places model initialization and weight loading on the cold-start path. Spatial GPU sharing such as multi-instance GPU (MIG) provides isolation and accounting, but each slice has too little HBM for modern LLM weights. We observe that high-bandwidth CPU--GPU interconnects, such as NVLink-C2C (C2C) in NVIDIA GH200 and GB200 Superchips, change the memory constraint: model weights can reside in CPU memory and be streamed on demand to MIG instances, shifting model residency from scarce HBM to abundant host memory. Leveraging this capability, we present C2CServe, a request-granularity serverless LLM serving system that allows MIG…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
