DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
Jovan Stojkovic, Chaojie Zhang, \'I\~nigo Goiri, Josep Torrellas, Esha Choukse

TL;DR
DynamoLLM is a novel energy-management framework that dynamically optimizes large language model inference clusters for energy efficiency and cost savings while maintaining performance SLOs.
Contribution
It introduces the first framework for automatic, dynamic reconfiguration of LLM inference clusters to improve energy efficiency and reduce costs.
Findings
Conserves 53% energy and reduces 38% carbon emissions.
Reduces 61% operational costs for LLM inference.
Maintains latency SLOs despite energy optimizations.
Abstract
The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Neural Networks and Applications · Parallel Computing and Optimization Techniques
Methodstravel james
