Efficient Heterogeneous Large Language Model Decoding with   Model-Attention Disaggregation

Shaoyuan Chen; Wencong Xiao; Yutong Lin; Mingxing Zhang; Yingdi Shan,; Jinlei Jiang; Kang Chen; Yongwei Wu

arXiv:2405.01814·cs.LG·April 11, 2025·1 cites

Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation

Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan,, Jinlei Jiang, Kang Chen, Yongwei Wu

PDF

Open Access

TL;DR

This paper proposes a novel heterogeneous inference architecture for large language models that disaggregates attention computation onto memory-optimized devices, significantly improving throughput and efficiency.

Contribution

It introduces model-attention disaggregation, enabling efficient splitting of attention operators across heterogeneous devices, and demonstrates its effectiveness with the Lamina system.

Findings

01

Lamina achieves 16.1% to 90.1% higher throughput than existing solutions.

02

Disaggregating attention reduces memory bottlenecks and improves resource utilization.

03

Communication overhead between devices remains manageable with current networking technologies.

Abstract

Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests. To enhance the efficiency of LLM decoding, we introduce model-attention disaggregation. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling