The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
Bole Ma, Ayesha Afzal, Jan Eitzinger, Gerhard Wellein

TL;DR
This paper demonstrates that power capping is ineffective during autoregressive decode in LLM serving, and proposes clock locking as a superior energy-saving alternative that improves efficiency across various attention architectures.
Contribution
It reveals the illusion of power capping in LLM decode and introduces clock locking as a more effective energy management technique across multiple architectures.
Findings
Power capping does not trigger during decode due to memory bandwidth saturation.
Clock locking can recover up to 32% of decode energy with minimal throughput loss.
A common energy pattern is identified across different attention replacements, showing a heavy prefill cost balanced by efficient decode.
Abstract
Power capping is the standard GPU energy lever in LLM serving, and it appears to work: throughput drops, power readings fall, and energy budgets are met. We show the appearance is illusory for the phase that dominates production serving: autoregressive decode. Across four attention paradigms -- GQA, MLA, Gated DeltaNet, and Mamba2 -- on NVIDIA H200, decode draws only 137--300\,W on a 700\,W GPU; no cap ever triggers, because memory-bound decode saturates HBM bandwidth rather than compute and leaves power headroom untouched. Firmware-initiated clock throttling compounds the illusion: these deviations can corrupt any throughput measurement that attributes them to the cap. SM clock locking dissolves both confounds. By targeting the lever that is actually on the critical path, clock locking Pareto-dominates power capping universally, recovering up to 32\% of decode energy at minimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
