AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization
Zicong Ye, Kunming Zhang, Guoming Tang

TL;DR
AGFT is an adaptive GPU frequency tuning framework using reinforcement learning to optimize energy efficiency during real-time LLM inference, reducing energy consumption by over 40% with minimal latency impact.
Contribution
We introduce AGFT, a novel reinforcement learning-based framework that dynamically adjusts GPU frequencies for energy-efficient LLM inference without performance loss.
Findings
44.3% GPU energy savings achieved
Less than 10% latency overhead
Up to 40.3% energy-delay product improvement
Abstract
The explosive growth of interactive Large Language Models (LLMs) has placed unprecedented demands for low latency on cloud GPUs, forcing them into high-power modes and causing escalating energy costs. Real-time inference workloads exhibit significant dynamic volatility, presenting substantial energy-saving opportunities. However, traditional static or rule-based power management strategies struggle to exploit these opportunities without compromising peak performance. To address this challenge, we propose AGFT (An Adaptive GPU Frequency Tuner), a framework that employs online reinforcement learning to autonomously learn an optimal frequency tuning policy. By monitoring real-time features like request load and latency, AGFT utilizes fine-grained frequency control for precise adjustments and intelligent action space pruning for stable, efficient decision-making. This creates a robust,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Cloud Computing and Resource Management · Green IT and Sustainability
