Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications

Jiaxi Li; Yue Zhu; Eun Kyung Lee; Klara Nahrstedt

arXiv:2601.08833·cs.PF·January 15, 2026

Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications

Jiaxi Li, Yue Zhu, Eun Kyung Lee, Klara Nahrstedt

PDF

Open Access

TL;DR

This paper systematically benchmarks disaggregated LLM serving, analyzing performance and energy implications of different KV cache transfer methods and optimizations, revealing that benefits depend on load and transfer medium.

Contribution

It provides a comprehensive evaluation of disaggregated LLM serving, including new baselines and analysis of energy-performance trade-offs across transfer mediums and optimization strategies.

Findings

01

Performance benefits depend on request load and transfer medium.

02

Disaggregation does not guarantee energy savings due to higher energy consumption.

03

Stage-wise frequency scaling does not lead to energy savings in disaggregated setups.

Abstract

Different from traditional Large Language Model (LLM) serving that colocates the prefill and decode stages on the same GPU, disaggregated serving dedicates distinct GPUs to prefill and decode workload. Once the prefill GPU completes its task, the KV cache must be transferred to the decode GPU. While existing works have proposed various KV cache transfer paths across different memory and storage tiers, there remains a lack of systematic benchmarking that compares their performance and energy efficiency. Meanwhile, although optimization techniques such as KV cache reuse and frequency scaling have been utilized for disaggregated serving, their performance and energy implications have not been rigorously benchmarked. In this paper, we fill this research gap by re-evaluating prefill-decode disaggregation under different KV transfer mediums and optimization strategies. Specifically, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Cloud Computing and Resource Management