Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications
Jiaxi Li, Yue Zhu, Eun Kyung Lee, Klara Nahrstedt

TL;DR
This paper systematically benchmarks disaggregated LLM serving, analyzing performance and energy implications of different KV cache transfer methods and optimizations, revealing that benefits depend on load and transfer medium.
Contribution
It provides a comprehensive evaluation of disaggregated LLM serving, including new baselines and analysis of energy-performance trade-offs across transfer mediums and optimization strategies.
Findings
Performance benefits depend on request load and transfer medium.
Disaggregation does not guarantee energy savings due to higher energy consumption.
Stage-wise frequency scaling does not lead to energy savings in disaggregated setups.
Abstract
Different from traditional Large Language Model (LLM) serving that colocates the prefill and decode stages on the same GPU, disaggregated serving dedicates distinct GPUs to prefill and decode workload. Once the prefill GPU completes its task, the KV cache must be transferred to the decode GPU. While existing works have proposed various KV cache transfer paths across different memory and storage tiers, there remains a lack of systematic benchmarking that compares their performance and energy efficiency. Meanwhile, although optimization techniques such as KV cache reuse and frequency scaling have been utilized for disaggregated serving, their performance and energy implications have not been rigorously benchmarked. In this paper, we fill this research gap by re-evaluating prefill-decode disaggregation under different KV transfer mediums and optimization strategies. Specifically, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Cloud Computing and Resource Management
