Reinforcement Learning-Based Dynamic Management of Structured Parallel Farm Skeletons on Serverless Platforms
Lanpei Li, Massimo Coppola, Malio Li, Valerio Besozzi, Jack Bell, Vincenzo Lomonaco

TL;DR
This paper introduces a reinforcement learning framework for dynamic management of structured parallel skeletons on serverless platforms, aiming to improve performance and resilience while maintaining programmability.
Contribution
It presents a novel RL-based autoscaling approach for farm patterns on serverless platforms, combining monitoring, control, and learning for optimized resource management.
Findings
RL policies outperform reactive management in QoS metrics
AI-driven scaling adapts better to platform limitations
Improved resource efficiency and stable scaling behavior
Abstract
We present a framework for dynamic management of structured parallel processing skeletons on serverless platforms. Our goal is to bring HPC-like performance and resilience to serverless and continuum environments while preserving the programmability benefits of skeletons. As a first step, we focus on the well known Farm pattern and its implementation on the open-source OpenFaaS platform, treating autoscaling of the worker pool as a QoS-aware resource management problem. The framework couples a reusable farm template with a Gymnasium-based monitoring and control layer that exposes queue, timing, and QoS metrics to both reactive and learning-based controllers. We investigate the effectiveness of AI-driven dynamic scaling for managing the farm's degree of parallelism via the scalability of serverless functions on OpenFaaS. In particular, we discuss the autoscaling model and its training,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Software System Performance and Reliability
