On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach
Anas Hattay, Fred Ngole Mboula, Eric Gascard, and Zakaria Yahoun

TL;DR
This paper investigates the limitations of GNN-based deep reinforcement learning schedulers in energy-aware cloud workflows, revealing how structural mismatches cause performance failures under distribution shifts.
Contribution
It identifies specific out-of-distribution conditions that cause GNN-based schedulers to fail and explains the underlying reasons for these failures.
Findings
GNN-based schedulers degrade under distribution shifts due to structural mismatches.
Performance issues are linked to disrupted message passing in GNNs.
Highlights the need for more robust representations for reliable scheduling.
Abstract
Cloud providers must assign heterogeneous compute resources to workflow DAGs while balancing competing objectives such as completion time, cost, and energy consumption. In this work, we study a single-workflow, queue-free scheduling setting and consider a graph neural network (GNN)-based deep reinforcement learning scheduler designed to minimize workflow completion time and energy usage. We identify specific out-of-distribution (OOD) conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
