Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads
Chendong Song, Meixuan Wang, Hang Zhou, Hong Liang, Yuan Lyu, Zixi Chen, Yuwei Fan, Zijie Zhou

TL;DR
This paper develops an analytical framework for optimally provisioning Attention-FFN disaggregated LLM serving systems under stochastic workloads, balancing memory and compute resources to minimize idle time and blocking.
Contribution
It introduces a novel workload-driven provisioning model with a closed-form optimal ratio and a Gaussian refinement, validated by trace-calibrated simulations.
Findings
Predicted optimal A/F ratio matches simulation within 10%.
Workload statistic θ governs provisioning across distributions.
Framework decomposes bottlenecks into Attention, communication, and FFN regimes.
Abstract
Attentio-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop an analytical provisioning framework for AFD bundles in an A--F topology under stochastic workloads. Two sources of randomness shape the problem: per-slot Attention workload evolves as KV caches grow and completed requests are replenished with random prompt and decode lengths, and synchronized execution across Attention workers introduces a barrier governed by the slowest worker. We address both via a renewal-reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
