Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling
Paul Joe Maliakel, Shashikant Ilager, Ivona Brandic

TL;DR
This paper analyzes how different workloads and GPU scaling affect energy and performance tradeoffs in large language model inference, revealing opportunities for energy savings with minimal latency impact.
Contribution
It provides a measurement-driven analysis of workload heterogeneity and GPU DVFS effects on LLM inference, highlighting the potential for energy-efficient inference strategies.
Findings
Decode phase dominates inference time (77-91%)
Reducing GPU frequency yields 42% energy savings with minimal latency increase
Semantic features better predict inference difficulty than input length
Abstract
LLM inference exhibits substantial variability across queries and execution phases, yet inference configurations are often applied uniformly. We present a measurement-driven characterization of workload heterogeneity and energy-performance behavior of LLM inference under GPU dynamic voltage and frequency scaling (DVFS). We evaluate five decoder-only LLMs (1B-32B parameters) across four NLP benchmarks using a controlled offline setup. We show that lightweight semantic features predict inference difficulty better than input length, with 44.5% of queries achieving comparable quality across model sizes. At the hardware level, the decode phase dominates inference time (77-91%) and is largely insensitive to GPU frequency. Consequently, reducing GPU frequency from 2842 MHz to 180 MHz achieves an average of 42% energy savings with only a 1-6% latency increase. We further provide a use case with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Adam · Residual Connection · Dropout · Linear Layer · Linear Warmup With Cosine Annealing · Weight Decay · Multi-Head Attention
