Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

Vikranth Srivatsa; Zijian He; Pu Guo; Dongming Li; Yiying Zhang

arXiv:2605.05467·cs.DC·May 8, 2026

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

Vikranth Srivatsa, Zijian He, Pu Guo, Dongming Li, Yiying Zhang

PDF

TL;DR

Nitsum is a distributed LLM serving system that dynamically adapts tensor parallelism and scheduling to optimize throughput and meet latency targets in multi-tenant environments.

Contribution

It introduces a novel approach to treat tensor parallelism as a runtime control, enabling dynamic adaptation for improved performance.

Findings

01

Nitsum achieves up to 5.3x better SLO-compliant goodput over state-of-the-art systems.

02

The system effectively manages workload variability and multi-tier contention.

03

Experimental results validate the benefits of adaptive tensor parallelism.

Abstract

LLM serving is increasingly multi-tenant: the same deployment must handle latency-critical interactive requests and more relaxed background workloads under a fixed GPU budget. This creates a tiered-SLO setting where maximizing overall goodput (requests that satisfy both TTFT and TPOT targets) is challenging because workload mix, request lengths, and load intensity vary over time. Existing systems mainly optimize request-level controls (e.g., queuing and batching) while keeping execution configuration largely static, which limits adaptation under multi-tier contention. We present Nitsum, a distributed LLM serving system that treats tensor parallelism (TP) as a first-class runtime control surface rather than a static deployment choice. Nitsum jointly optimizes TP level, prefill/decode GPU split, and request scheduling. To make frequent TP adaptation practical, Nitsum introduces TP-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.