TL;DR
SuperInfer is a novel LLM inference system for Superchips that employs SLO-aware rotary scheduling and memory management to significantly improve latency SLO compliance under high request loads.
Contribution
It introduces RotaSched, a proactive rotary scheduler, and DuplexKV, an optimized NVLink-C2C transfer engine, to enhance responsiveness and SLO adherence on Superchips.
Findings
Up to 74.7% improvement in TTFT SLO attainment rates.
Maintains comparable TBT and throughput to state-of-the-art systems.
Demonstrates the effectiveness of SLO-aware scheduling and memory co-design.
Abstract
Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Data Storage Technologies
