EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization
Zhiye Song, Kyungmi Lee, Eun Kyung Lee, Xin Zhang, Tamar Eilam, Anantha P. Chandrakasan

TL;DR
EnergyLens is a comprehensive framework that predicts and optimizes energy consumption for multi-GPU large language model inference, aiding practitioners in efficient deployment decisions.
Contribution
It introduces an intuitive interface and models for accurately predicting multi-GPU energy behavior, enabling energy-aware optimization without exhaustive profiling.
Findings
Achieves MAPEs between 9.25% and 13.19% for energy prediction.
Reveals up to 1.47x and 52.9x energy variation across configurations.
Identifies Pareto-optimal overlap configurations for compute-communication overlap.
Abstract
We present EnergyLens, an end-to-end framework for energy-aware large language model (LLM) inference optimization. As LLMs scale, predicting and reducing their energy footprint has become critical for sustainability and datacenter operations, yet existing approaches either require production-level code and expensive profiling or fail to accurately capture multi-GPU energy behavior. As a result, practitioners lack tools for deciding which optimizations to prioritize and for selecting among existing deployment configurations when exhaustive profiling is impractical. EnergyLens addresses this gap with an intuitive einsum-based interface that captures LLM specifications including fusion, parallelism, and compute-communication overlap, combined with load-imbalance-aware MoE modeling and an empirically driven communication energy model for multi-GPU settings. We validate EnergyLens on Llama3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
