Large Language Model Partitioning for Low-Latency Inference at the Edge
Dimitrios Kafetzis, Ramin Khalili, Iordanis Koutsopoulos

TL;DR
This paper introduces a dynamic, resource-aware partitioning algorithm for large language models that reduces inference latency at the edge by intelligently distributing attention heads across devices.
Contribution
It proposes a novel, myopic partitioning method that dynamically migrates attention heads during inference to optimize latency and resource utilization in edge environments.
Findings
Achieves 15-20% latency close to optimal in small-scale setups.
Significantly improves inference speed over layer-based partitioning.
Reduces memory usage while maintaining low latency.
Abstract
Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence, the length grows and so does the memory and compute load, due to the expanding key-value caches, which store intermediate representations of all previously generated tokens in the multi-head attention (MHA) layer. As this iterative process steadily increases memory and compute demands, layer-based partitioning in resource-constrained edge environments often results in memory overload or high inference latency. To address this and reduce inference latency, we propose a resource-aware Transformer architecture partitioning algorithm, where the partitioning decision is updated at regular intervals during token generation. The approach is myopic in that it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Adam · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax
