SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

Andreas Kosmas Kakolyris; Dimosthenis Masouros; Petros Vavaroutsos; Sotirios Xydis; Dimitrios Soudris

arXiv:2408.05235·cs.DC·December 4, 2025

SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris

PDF

Open Access

TL;DR

This paper introduces throttLLeM, a machine learning-based framework that optimizes GPU frequency scaling for LLM inference, significantly reducing energy consumption while maintaining service quality.

Contribution

It presents a novel ML-driven approach for dynamic GPU frequency scaling that projects workload parameters to meet SLOs efficiently.

Findings

01

Achieves up to 43.8% energy reduction.

02

Improves energy efficiency by at least 1.71x.

03

ML model predicts performance with high accuracy (R^2 > 0.97).

Abstract

As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present \textit{throttLL'eM}, a framework that reduces energy consumption while meeting SLOs through the use of instance and GPU frequency scaling. \textit{throttLL'eM} features mechanisms that project future KV cache usage and batch size. Leveraging a Machine-Learning (ML) model that receives these projections as inputs, \textit{throttLL'eM} manages performance at the iteration level to satisfy SLOs with reduced frequencies and instance sizes. We show that the proposed ML model achieves $R^{2}$ scores greater than 0.97 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Algorithms and Applications · Neural Networks and Applications · Advanced Data Compression Techniques