Intelligent Resource Scheduling for Co-located Latency-critical   Services: A Multi-Model Collaborative Learning Approach

Lei Liu

arXiv:1911.13208·cs.DC·September 7, 2022·1 cites

Intelligent Resource Scheduling for Co-located Latency-critical Services: A Multi-Model Collaborative Learning Approach

Lei Liu

PDF

Open Access

TL;DR

This paper introduces OSML, a multi-model machine learning-based scheduler that improves resource allocation for co-located latency-critical services by avoiding resource cliffs and enhancing QoS stability.

Contribution

It presents a novel collaborative ML approach that predicts QoS variations and intelligently guides resource scheduling to improve efficiency and stability in cloud environments.

Findings

01

Supports higher load levels with QoS guarantees

02

Reduces scheduling overhead and convergence time

03

Effectively avoids resource cliffs during scheduling

Abstract

Latency-critical services have been widely deployed in cloud environments. For cost-efficiency, multiple services are usually co-located on a server. Thus, run-time resource scheduling becomes the pivot for QoS control in these complicated co-location cases. However, the scheduling exploration space enlarges rapidly with the increasing server resources, making the schedulers hardly provide ideal solutions quickly. More importantly, we observe that there are "resource cliffs" in the scheduling exploration space. They affect the exploration efficiency and always lead to severe QoS fluctuations. Resource cliffs cannot be easily avoided in previous schedulers. To address these problems, we propose a novel ML-based intelligent scheduler - OSML. It learns the correlation between architectural hints (e.g., IPC, cache misses, memory footprint, etc.), scheduling solutions and the QoS demands…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Software System Performance and Reliability