Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems
Guowei Liu, Hongming Li, Yaning Guo, Yongxi Lyu, Mo Zhou, Yi Liu, Zhaogeng Li, Yanpeng Wang

TL;DR
This paper systematically analyzes Attention-FFN Disaggregation (AFD) in large-scale MoE models, revealing its performance limits on standard clusters and identifying hardware-model conditions where AFD is advantageous.
Contribution
It extends the roofline model to communication levels, providing insights into AFD's performance boundaries and conditions for effective deployment.
Findings
AFD faces a dead zone on standard clusters due to bandwidth limitations.
Higher imbalance penalties occur with discrete node-level scaling in AFD.
A specific hardware-model combination can mitigate AFD's limitations and enhance performance.
Abstract
Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Distributed and Parallel Computing Systems
