Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

Paul Joe Maliakel; Shashikant Ilager; Ivona Brandic

arXiv:2501.08219·cs.LG·February 25, 2026·5 cites

Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

Paul Joe Maliakel, Shashikant Ilager, Ivona Brandic

PDF

Open Access

TL;DR

This paper analyzes how different workloads and GPU scaling affect energy and performance tradeoffs in large language model inference, revealing opportunities for energy savings with minimal latency impact.

Contribution

It provides a measurement-driven analysis of workload heterogeneity and GPU DVFS effects on LLM inference, highlighting the potential for energy-efficient inference strategies.

Findings

01

Decode phase dominates inference time (77-91%)

02

Reducing GPU frequency yields 42% energy savings with minimal latency increase

03

Semantic features better predict inference difficulty than input length

Abstract

LLM inference exhibits substantial variability across queries and execution phases, yet inference configurations are often applied uniformly. We present a measurement-driven characterization of workload heterogeneity and energy-performance behavior of LLM inference under GPU dynamic voltage and frequency scaling (DVFS). We evaluate five decoder-only LLMs (1B-32B parameters) across four NLP benchmarks using a controlled offline setup. We show that lightweight semantic features predict inference difficulty better than input length, with 44.5% of queries achieving comparable quality across model sizes. At the hardware level, the decode phase dominates inference time (77-91%) and is largely insensitive to GPU frequency. Consequently, reducing GPU frequency from 2842 MHz to 180 MHz achieves an average of 42% energy savings with only a 1-6% latency increase. We further provide a use case with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Adam · Residual Connection · Dropout · Linear Layer · Linear Warmup With Cosine Annealing · Weight Decay · Multi-Head Attention