ATLAS: An Adaptive Failure-aware Scheduler for Hadoop

Mbarka Soualhia; Foutse Khomh; Sofiene Tahar

arXiv:1511.01446·cs.DC·November 4, 2016·2 cites

ATLAS: An Adaptive Failure-aware Scheduler for Hadoop

Mbarka Soualhia, Foutse Khomh, Sofiene Tahar

PDF

Open Access

TL;DR

ATLAS is an adaptive failure-aware scheduler for Hadoop that predicts and mitigates task failures in cloud environments, significantly improving job success rates and reducing execution time and resource usage.

Contribution

This paper introduces ATLAS, a novel scheduler that dynamically predicts failures and adapts scheduling decisions in Hadoop to enhance reliability and efficiency in cloud settings.

Findings

01

Reduces failed jobs by up to 28%.

02

Decreases failed tasks by up to 39%.

03

Speeds up job completion by 10 minutes on average.

Abstract

Hadoop has become the de facto standard for processing large data in today's cloud environment. The performance of Hadoop in the cloud has a direct impact on many important applications ranging from web analytic, web indexing, image and document processing to high-performance scientific computing. However, because of the scale, complexity and dynamic nature of the cloud, failures are common and these failures often impact the performance of jobs running in Hadoop. Although Hadoop possesses built-in failure detection and recovery mechanisms, several scheduled jobs still fail because of unforeseen events in the cloud environment. A single task failure can cause the failure of the whole job and unpredictable job running times. In this report, we propose ATLAS (AdapTive faiLure-Aware Scheduler), a new scheduler for Hadoop that can adapt its scheduling decisions to events occurring in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · IoT and Edge/Fog Computing