Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
Vikranth Srivatsa, Zijian He, Pu Guo, Dongming Li, Yiying Zhang

TL;DR
Nitsum is a distributed LLM serving system that dynamically adapts tensor parallelism and scheduling to optimize throughput and meet latency targets in multi-tenant environments.
Contribution
It introduces a novel approach to treat tensor parallelism as a runtime control, enabling dynamic adaptation for improved performance.
Findings
Nitsum achieves up to 5.3x better SLO-compliant goodput over state-of-the-art systems.
The system effectively manages workload variability and multi-tier contention.
Experimental results validate the benefits of adaptive tensor parallelism.
Abstract
LLM serving is increasingly multi-tenant: the same deployment must handle latency-critical interactive requests and more relaxed background workloads under a fixed GPU budget. This creates a tiered-SLO setting where maximizing overall goodput (requests that satisfy both TTFT and TPOT targets) is challenging because workload mix, request lengths, and load intensity vary over time. Existing systems mainly optimize request-level controls (e.g., queuing and batching) while keeping execution configuration largely static, which limits adaptation under multi-tier contention. We present Nitsum, a distributed LLM serving system that treats tensor parallelism (TP) as a first-class runtime control surface rather than a static deployment choice. Nitsum jointly optimizes TP level, prefill/decode GPU split, and request scheduling. To make frequent TP adaptation practical, Nitsum introduces TP-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
