SOL: Safe On-Node Learning in Cloud Platforms

Yawen Wang; Daniel Crankshaw; Neeraja J. Yadwadkar; Daniel Berger,; Christos Kozyrakis; Ricardo Bianchini

arXiv:2201.10477·cs.OS·January 26, 2022

SOL: Safe On-Node Learning in Cloud Platforms

Yawen Wang, Daniel Crankshaw, Neeraja J. Yadwadkar, Daniel Berger,, Christos Kozyrakis, Ricardo Bianchini

PDF

Open Access

TL;DR

This paper introduces SOL, a framework enabling safe, on-node machine learning for cloud platform agents, improving their performance while ensuring robustness against failures.

Contribution

We propose SOL, an extensible API and system for deploying safe, robust ML-based agents in cloud nodes, demonstrated through three practical agent implementations.

Findings

01

ML improves agent performance in managing resources

02

SOL ensures agent safety under failure conditions

03

ML-based agents have significant potential in cloud management

Abstract

Cloud platforms run many software agents on each server node. These agents manage all aspects of node operation, and in some cases frequently collect data and make decisions. Unfortunately, their behavior is typically based on pre-defined static heuristics or offline analysis; they do not leverage on-node machine learning (ML). In this paper, we first characterize the spectrum of node agents in Azure, and identify the classes of agents that are most likely to benefit from on-node ML. We then propose SOL, an extensible framework for designing ML-based agents that are safe and robust to the range of failure conditions that occur in production. SOL provides a simple API to agent developers and manages the scheduling and running of the agent-specific functions they write. We illustrate the use of SOL by implementing three ML-based agents that manage CPU cores, node power, and memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Blockchain Technology Applications and Security