START: Straggler Prediction and Mitigation for Cloud Computing   Environments using Encoder LSTM Networks

Shreshth Tuli; Sukhpal Singh Gill; Peter Garraghan; Rajkumar Buyya,; Giuliano Casale; Nicholas R. Jennings

arXiv:2111.10241·cs.DC·November 22, 2021

START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks

Shreshth Tuli, Sukhpal Singh Gill, Peter Garraghan, Rajkumar Buyya,, Giuliano Casale, Nicholas R. Jennings

PDF

TL;DR

This paper introduces START, a proactive straggler prediction and mitigation method using Encoder LSTM networks, which improves cloud job response times and reduces SLA violations by analyzing resource consumption patterns.

Contribution

The paper presents a novel Encoder LSTM-based approach for proactive straggler prediction and dynamic scheduling in cloud environments, outperforming existing reactive and prediction-based methods.

Findings

01

START reduces execution time by 13%

02

START decreases SLA violations by 19%

03

START lowers energy consumption by 16%

Abstract

Modern large-scale computing systems distribute jobs into multiple smaller tasks which execute in parallel to accelerate job completion rates and reduce energy consumption. However, a common performance problem in such systems is dealing with straggler tasks that are slow running instances that increase the overall response time. Such tasks can significantly impact the system's Quality of Service (QoS) and the Service Level Agreements (SLA). To combat this issue, there is a need for automatic straggler detection and mitigation mechanisms that execute jobs without violating the SLA. Prior work typically builds reactive models that focus first on detection and then mitigation of straggler tasks, which leads to delays. Other works use prediction based proactive mechanisms, but ignore heterogeneous host or volatile task characteristics. In this paper, we propose a Straggler Prediction and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.