SLA-MORL: SLA-Aware Multi-Objective Reinforcement Learning for HPC Resource Optimization

Seraj Al Mahmud Mostafa; Aravind Mohan; Jianwu Wang

arXiv:2508.03509·cs.LG·August 6, 2025

SLA-MORL: SLA-Aware Multi-Objective Reinforcement Learning for HPC Resource Optimization

Seraj Al Mahmud Mostafa, Aravind Mohan, Jianwu Wang

PDF

TL;DR

SLA-MORL is an adaptive multi-objective reinforcement learning framework that optimizes HPC resource allocation for machine learning workloads, balancing training time, cost, and SLA compliance through intelligent initialization and dynamic weight adjustment.

Contribution

It introduces a novel RL-based approach with cold-start reduction and real-time priority adaptation for SLA-aware resource management in HPC environments.

Findings

01

Achieves 67.2% reduction in training time for deadline jobs

02

Reduces operational costs by 68.8% for budget workloads

03

Improves SLA compliance by 73.4% over static methods

Abstract

Dynamic resource allocation for machine learning workloads in cloud environments remains challenging due to competing objectives of minimizing training time and operational costs while meeting Service Level Agreement (SLA) constraints. Traditional approaches employ static resource allocation or single-objective optimization, leading to either SLA violations or resource waste. We present SLA-MORL, an adaptive multi-objective reinforcement learning framework that intelligently allocates GPU and CPU resources based on user-defined preferences (time, cost, or balanced) while ensuring SLA compliance. Our approach introduces two key innovations: (1) intelligent initialization through historical learning or efficient baseline runs that eliminates cold-start problems, reducing initial exploration overhead by 60%, and (2) dynamic weight adaptation that automatically adjusts optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.