ATLAS: An Adaptive Failure-aware Scheduler for Hadoop
Mbarka Soualhia, Foutse Khomh, Sofiene Tahar

TL;DR
ATLAS is an adaptive failure-aware scheduler for Hadoop that predicts and mitigates task failures in cloud environments, significantly improving job success rates and reducing execution time and resource usage.
Contribution
This paper introduces ATLAS, a novel scheduler that dynamically predicts failures and adapts scheduling decisions in Hadoop to enhance reliability and efficiency in cloud settings.
Findings
Reduces failed jobs by up to 28%.
Decreases failed tasks by up to 39%.
Speeds up job completion by 10 minutes on average.
Abstract
Hadoop has become the de facto standard for processing large data in today's cloud environment. The performance of Hadoop in the cloud has a direct impact on many important applications ranging from web analytic, web indexing, image and document processing to high-performance scientific computing. However, because of the scale, complexity and dynamic nature of the cloud, failures are common and these failures often impact the performance of jobs running in Hadoop. Although Hadoop possesses built-in failure detection and recovery mechanisms, several scheduled jobs still fail because of unforeseen events in the cloud environment. A single task failure can cause the failure of the whole job and unpredictable job running times. In this report, we propose ATLAS (AdapTive faiLure-Aware Scheduler), a new scheduler for Hadoop that can adapt its scheduling decisions to events occurring in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · IoT and Edge/Fog Computing
