Coordinated Cooling and Compute Management for AI Datacenters

Nardos Belay Abera; Yize Chen

arXiv:2601.08113·eess.SY·January 14, 2026

Coordinated Cooling and Compute Management for AI Datacenters

Nardos Belay Abera, Yize Chen

PDF

Open Access

TL;DR

This paper introduces a joint cooling and compute management framework for AI datacenters that optimizes GPU performance and thermal constraints to enhance energy efficiency during large-scale AI inference tasks.

Contribution

It develops a novel hierarchical control framework that co-optimizes GPU compute parameters and cooling strategies based on workload and thermal models.

Findings

01

Improves energy efficiency of AI datacenters during inference.

02

Balances latency and thermal constraints effectively.

03

Demonstrates benefits using real Azure inference traces.

Abstract

The AI datacenters are currently being deployed on a large scale to support the training and deployment of power-intensive large-language models (LLMs). Extensive amount of computation and cooling required in datacenters increase concerns about the energy use and carbon emissions of AI datacenters. Although current state-of-the-art has examined the energy efficiency of LLM inference, most prior research focused on optimizing compute-side scheduling without considering thermal objectives or constraints. Since GPU-intensive inference generates substantial heat that can degrade datacenter performance, ignoring thermal effects can increase total energy consumption and reduce the efficiency of LLM serving. To fill this gap, we profile the characteristics of GPU servers under varying cooling and AI jobs, and develop a joint cooling and computing modeling approach for AI datacenters. Built…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · IoT and Edge/Fog Computing