DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Jovan Stojkovic; Chaojie Zhang; \'I\~nigo Goiri; Josep Torrellas; Esha Choukse

arXiv:2408.00741·cs.AI·October 1, 2025·5 cites

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Jovan Stojkovic, Chaojie Zhang, \'I\~nigo Goiri, Josep Torrellas, Esha Choukse

PDF

Open Access

TL;DR

DynamoLLM is a novel energy-management framework that dynamically optimizes large language model inference clusters for energy efficiency and cost savings while maintaining performance SLOs.

Contribution

It introduces the first framework for automatic, dynamic reconfiguration of LLM inference clusters to improve energy efficiency and reduce costs.

Findings

01

Conserves 53% energy and reduces 38% carbon emissions.

02

Reduces 61% operational costs for LLM inference.

03

Maintains latency SLOs despite energy optimizations.

Abstract

The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Neural Networks and Applications · Parallel Computing and Optimization Techniques

Methodstravel james