Regulating Branch Parallelism in LLM Serving
Swapnil Gandhi, Siva Hari, William J. Dally, Christos Kozyrakis

TL;DR
This paper introduces TAPER, a dynamic admission controller for branch parallelism in LLM serving, which optimizes throughput and latency by regulating branch execution based on workload conditions.
Contribution
TAPER is a novel per-step admission controller that adaptively manages branch parallelism, improving throughput and latency in LLM serving systems.
Findings
TAPER improves goodput by 1.77x over IRP-Off and 1.48x over IRP-Eager.
TAPER maintains over 95% SLO attainment.
Existing methods are brittle due to fixed caps or eager execution, leading to inefficiencies.
Abstract
Recent methods expose intra-request parallelism in LLM outputs, allowing independent branches to decode concurrently. Existing serving systems execute these branches eagerly or under fixed caps. We show that both are brittle: eager admission inflates the shared decode step, degrading co-batched requests in serial stages, while conservative fixed caps forgo the throughput that motivated exposing branches in the first place. We call the excess step latency caused by admitted branches the branch externality and show that the safe width depends on batch composition, context lengths, and accumulated slack, all of which change continuously over a workload trace. We introduce TAPER, a per-step admission controller that treats extra branches as opportunistic work, admitted only when the predicted branch externality fits within the batch's current slack budget. Per-step regulation is practical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
