GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving
Qunyou Liu, Darong Huang, Marina Zapater, David Atienza

TL;DR
GreenLLM introduces an SLO-aware dynamic frequency scaling framework for energy-efficient large language model serving, effectively reducing GPU energy consumption by up to 34% without sacrificing throughput.
Contribution
It proposes a novel separation of control for prefill and decode phases, with models and optimization techniques tailored to each stage's characteristics.
Findings
Reduces GPU energy by up to 34% in real trace workloads.
Maintains throughput with less than 3.5% SLO violations.
Effectively manages latency and energy trade-offs in LLM serving.
Abstract
Large Language Models (LLMs) are becoming the backbone of modern cloud services, yet their inference costs are dominated by GPU energy. Unlike traditional GPU workloads, LLM inference has two stages with different characteristics: the prefill phase, which is latency sensitive and scales quadratically with prompt length, and the decode phase, which progresses token by token with unpredictable length. Current GPU power governors (for example, NVIDIA's default) overlook this asymmetry and treat both stages uniformly. The result is mismatched voltage and frequency settings, head-of-line blocking, and excessive energy use. We introduce GreenLLM, an SLO-aware serving framework that minimizes GPU energy by explicitly separating prefill and decode control. At ingress, requests are routed into length-based queues so short prompts avoid head-of-line blocking and TTFT improves. For prefill,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
