START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks
Shreshth Tuli, Sukhpal Singh Gill, Peter Garraghan, Rajkumar Buyya,, Giuliano Casale, Nicholas R. Jennings

TL;DR
This paper introduces START, a proactive straggler prediction and mitigation method using Encoder LSTM networks, which improves cloud job response times and reduces SLA violations by analyzing resource consumption patterns.
Contribution
The paper presents a novel Encoder LSTM-based approach for proactive straggler prediction and dynamic scheduling in cloud environments, outperforming existing reactive and prediction-based methods.
Findings
START reduces execution time by 13%
START decreases SLA violations by 19%
START lowers energy consumption by 16%
Abstract
Modern large-scale computing systems distribute jobs into multiple smaller tasks which execute in parallel to accelerate job completion rates and reduce energy consumption. However, a common performance problem in such systems is dealing with straggler tasks that are slow running instances that increase the overall response time. Such tasks can significantly impact the system's Quality of Service (QoS) and the Service Level Agreements (SLA). To combat this issue, there is a need for automatic straggler detection and mitigation mechanisms that execute jobs without violating the SLA. Prior work typically builds reactive models that focus first on detection and then mitigation of straggler tasks, which leads to delays. Other works use prediction based proactive mechanisms, but ignore heterogeneous host or volatile task characteristics. In this paper, we propose a Straggler Prediction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
