A parallel workload has extreme variability

R. Henwood; N. W. Watkins; S. C. Chapman; R. McLay

arXiv:1611.04167·cs.DC·November 21, 2016·1 cites

A parallel workload has extreme variability

R. Henwood, N. W. Watkins, S. C. Chapman, R. McLay

PDF

Open Access

TL;DR

This paper investigates the extreme variability in parallel workload durations in HPC and cloud environments, proposing a GEV model that explains the tail behavior and has broad applicability.

Contribution

It introduces a GEV-based model for understanding variability in parallel workloads and validates it with real-world cloud data, revealing a universal property of such systems.

Findings

01

GEV distribution naturally models extreme variability

02

Real-world cloud data aligns well with the GEV model

03

Implications for performance characterization and anomaly detection

Abstract

In both high-performance computing (HPC) environments and the public cloud, the duration of time to retrieve or save your results is simultaneously unpredictable and important to your over all resource budget. It is generally accepted ("Google: Taming the Long Latency Tail - When More Machines Equals Worse Results", Todd Hoff, highscalability.com 2012), but without a robust explanation, that identical parallel tasks do take different durations to complete -- a phenomena known as variability. This paper advances understanding of this topic. We carefully choose a model from which system-level complexity emerges that can be studied directly. We find that a generalized extreme value (GEV) model for variability naturally emerges. Using the public cloud, we find real-world observations have excellent agreement with our model. Since the GEV distribution is a limit distribution this suggests a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Theoretical and Computational Physics