Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts

Sungmin Yun; Seonyong Park; Hwayong Nam; Younjoo Lee; Gunjun Lee; Kwanhee Kyung; Sangpyo Kim; Nam Sung Kim; Jongmin Kim; Hyungyo Kim; Juhwan Cho; Seungmin Baek; Jung Ho Ahn

arXiv:2507.15465·cs.AR·January 30, 2026

Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts

Sungmin Yun, Seonyong Park, Hwayong Nam, Younjoo Lee, Gunjun Lee, Kwanhee Kyung, Sangpyo Kim, Nam Sung Kim, Jongmin Kim, Hyungyo Kim, Juhwan Cho, Seungmin Baek, Jung Ho Ahn

PDF

Open Access

TL;DR

This paper analyzes how recent transformer architectural innovations like MLA and MoE shift inference bottlenecks from memory-bound attention to compute-bound operations, suggesting new hardware optimization strategies.

Contribution

It reveals that MLA and MoE change the dominant bottlenecks in transformer inference, guiding hardware focus from attention acceleration to interconnects and workload balancing.

Findings

01

MLA's arithmetic intensity exceeds MHA by over two orders of magnitude.

02

Distributing MoE experts across accelerators balances computational load.

03

Hardware optimization should prioritize interconnects and workload distribution.

Abstract

Computational workloads composing traditional transformer models are starkly bifurcated. Multi-Head Attention (MHA) and Grouped-Query Attention are memory-bound due to low arithmetic intensity, while FeedForward Networks are compute-bound. This dichotomy has long motivated research into specialized hardware to mitigate the attention bottleneck. This paper argues that recent architectural advances in transformer models -- Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) -- introduce new dominant bottlenecks, shifting the challenge away from memory-intensive attention. We make two key observations. First, the arithmetic intensity of MLA is over two orders of magnitude higher than that of MHA, moving it toward a compute-bound regime well-matched to modern accelerators such as GPUs. Second, distributing MoE experts across a pool of accelerators allows batching to tune their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy