Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads
Grant Wilkins, Srinivasan Keshav, and Richard Mortier

TL;DR
This paper proposes a hybrid data center model with a dynamic scheduling framework that allocates LLM tasks to different hardware based on workload, reducing energy consumption by 7.5%.
Contribution
It introduces a workload-aware hybrid scheduling strategy that optimizes energy efficiency in LLM inference workloads across heterogeneous hardware.
Findings
Reduces CPU+GPU energy consumption by 7.5% with the hybrid approach.
Uses workload-aware task allocation based on input/output tokens.
Demonstrates energy savings in a representative LLM dataset.
Abstract
Both the training and use of Large Language Models (LLMs) require large amounts of energy. Their increasing popularity, therefore, raises critical concerns regarding the energy efficiency and sustainability of data centers that host them. This paper addresses the challenge of reducing energy consumption in data centers running LLMs. We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate LLM tasks across hardware accelerators that differ in their energy efficiencies and computational capabilities. Specifically, our workload-aware strategy determines whether tasks are processed on energy-efficient processors or high-performance GPUs based on the number of input and output tokens in a query. Our analysis of a representative LLM dataset, finds that this hybrid strategy can reduce CPU+GPU energy consumption by 7.5% compared to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Privacy-Preserving Technologies in Data
