From Attention to Disaggregation: Tracing the Evolution of LLM Inference
Madabattula Rajesh Kumar, Srinivasa Rao Aravilli, Mustafa Saify, Shashank Srivastava

TL;DR
This paper discusses the shift towards disaggregated inference architectures for large language models, aiming to optimize latency, throughput, and cost by applying distributed systems principles to overcome traditional GPU cluster limitations.
Contribution
It introduces a novel architectural approach that decouples inference phases into scalable components, improving performance and resource utilization for LLM deployment.
Findings
Decoupling inference phases reduces resource contention.
Disaggregated architecture improves latency and throughput.
Independent optimization of inference components enhances efficiency.
Abstract
The evolution of Large Language Models from the Transformer architecture to models with trillions of parameters has shifted the primary bottleneck from model training to real time inference. Deploying these massive models is a complex distributed systems challenge constrained by memory bandwidth, computational throughput, and latency requirements. LLM inference fundamentally requires solving a multi objective optimization problem to minimize latency, maximize throughput, and reduce cost. This paper explores the necessary architectural shift towards disaggregated inference, which applies distributed systems principles such as service decomposition, resource disaggregation, and workload partitioning to overcome the limitations of traditional monolithic GPU clusters. By decoupling the compute intensive prefill phase from the memory intensive decode phase into independently scalable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
