GrapheonRL: A Graph Neural Network and Reinforcement Learning Framework for Constraint and Data-Aware Workflow Mapping and Scheduling in Heterogeneous HPC Systems
Aasish Kumar Sharma, Julian Kunkel

TL;DR
GrapheonRL introduces a novel framework combining Graph Neural Networks and Reinforcement Learning to improve workflow scheduling in heterogeneous HPC systems, achieving near-optimal solutions with significantly reduced computation time.
Contribution
It presents a scalable, adaptive scheduling approach that handles dynamic constraints and workload complexities better than traditional ILP and heuristic methods.
Findings
76% faster than ILP-based solutions
Comparable to heuristics in speed, only 3.85 times slower than optimal
Effectively adapts to different workflows and system constraints
Abstract
Effective resource utilization and decreased makespan in heterogeneous High Performance Computing (HPC) environments are key benefits of workload mapping and scheduling. Tools such as Snakemake, a workflow management solution, employ Integer Linear Programming (ILP) and heuristic techniques to deploy workflows in various HPC environments like SLURM (Simple Linux Utility for Resource Management) or Kubernetes. Its scheduler factors in workflow task dependencies, resource requirements, and individual task data sizes before system deployment. ILP offers optimal solutions respecting constraints, but only for smaller workflows. Meanwhile, meta-heuristics and heuristics offer faster, though suboptimal, makespan. As problem sizes, system constraints, and complexities evolve, maintaining these schedulers becomes challenging. In this study, we propose a novel solution that integrates Graph…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
