Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts
Sungmin Yun, Seonyong Park, Hwayong Nam, Younjoo Lee, Gunjun Lee, Kwanhee Kyung, Sangpyo Kim, Nam Sung Kim, Jongmin Kim, Hyungyo Kim, Juhwan Cho, Seungmin Baek, Jung Ho Ahn

TL;DR
This paper analyzes how recent transformer architectural innovations like MLA and MoE shift inference bottlenecks from memory-bound attention to compute-bound operations, suggesting new hardware optimization strategies.
Contribution
It reveals that MLA and MoE change the dominant bottlenecks in transformer inference, guiding hardware focus from attention acceleration to interconnects and workload balancing.
Findings
MLA's arithmetic intensity exceeds MHA by over two orders of magnitude.
Distributing MoE experts across accelerators balances computational load.
Hardware optimization should prioritize interconnects and workload distribution.
Abstract
Computational workloads composing traditional transformer models are starkly bifurcated. Multi-Head Attention (MHA) and Grouped-Query Attention are memory-bound due to low arithmetic intensity, while FeedForward Networks are compute-bound. This dichotomy has long motivated research into specialized hardware to mitigate the attention bottleneck. This paper argues that recent architectural advances in transformer models -- Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) -- introduce new dominant bottlenecks, shifting the challenge away from memory-intensive attention. We make two key observations. First, the arithmetic intensity of MLA is over two orders of magnitude higher than that of MHA, moving it toward a compute-bound regime well-matched to modern accelerators such as GPUs. Second, distributing MoE experts across a pool of accelerators allows batching to tune their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy
