Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads

Chendong Song; Meixuan Wang; Hang Zhou; Hong Liang; Yuan Lyu; Zixi Chen; Yuwei Fan; Zijie Zhou

arXiv:2601.21351·cs.LG·May 13, 2026

Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads

Chendong Song, Meixuan Wang, Hang Zhou, Hong Liang, Yuan Lyu, Zixi Chen, Yuwei Fan, Zijie Zhou

PDF

TL;DR

This paper develops an analytical framework for optimally provisioning Attention-FFN disaggregated LLM serving systems under stochastic workloads, balancing memory and compute resources to minimize idle time and blocking.

Contribution

It introduces a novel workload-driven provisioning model with a closed-form optimal ratio and a Gaussian refinement, validated by trace-calibrated simulations.

Findings

01

Predicted optimal A/F ratio matches simulation within 10%.

02

Workload statistic θ governs provisioning across distributions.

03

Framework decomposes bottlenecks into Attention, communication, and FFN regimes.

Abstract

Attentio-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop an analytical provisioning framework for AFD bundles in an $r$ A-- $1$ F topology under stochastic workloads. Two sources of randomness shape the problem: per-slot Attention workload evolves as KV caches grow and completed requests are replenished with random prompt and decode lengths, and synchronized execution across Attention workers introduces a barrier governed by the slowest worker. We address both via a renewal-reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.