SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

Jiahuan Yu; Mingtao Hu; Zichao Lin; Minjia Zhang

arXiv:2601.20309·cs.DC·May 20, 2026

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

Jiahuan Yu, Mingtao Hu, Zichao Lin, Minjia Zhang

PDF

1 Repo

TL;DR

SuperInfer is a novel LLM inference system for Superchips that employs SLO-aware rotary scheduling and memory management to significantly improve latency SLO compliance under high request loads.

Contribution

It introduces RotaSched, a proactive rotary scheduler, and DuplexKV, an optimized NVLink-C2C transfer engine, to enhance responsiveness and SLO adherence on Superchips.

Findings

01

Up to 74.7% improvement in TTFT SLO attainment rates.

02

Maintains comparable TBT and throughput to state-of-the-art systems.

03

Demonstrates the effectiveness of SLO-aware scheduling and memory co-design.

Abstract

Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Supercomputing-System-AI-Lab/SuperInfer
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Data Storage Technologies