NURD: Negative-Unlabeled Learning for Online Datacenter Straggler Prediction
Yi Ding, Avinash Rao, Hyebin Song, Rebecca Willett, Henry Hoffmann

TL;DR
NURD introduces a negative-unlabeled learning method for early prediction of datacenter task stragglers without relying on complete labels or distribution assumptions, improving prediction accuracy and job completion times.
Contribution
The paper proposes NURD, a novel negative-unlabeled learning framework that predicts stragglers using only negative and unlabeled data, eliminating the need for labeled positive examples or distribution assumptions.
Findings
NURD improves F1 score by 2-11 percentage points over baselines.
NURD reduces job completion time by 2.0-8.8 percentage points.
Evaluation on Google and Alibaba traces demonstrates effectiveness.
Abstract
Datacenters execute large computational jobs, which are composed of smaller tasks. A job completes when all its tasks finish, so stragglers -- rare, yet extremely slow tasks -- are a major impediment to datacenter performance. Accurately predicting stragglers would enable proactive intervention, allowing datacenter operators to mitigate stragglers before they delay a job. While much prior work applies machine learning to predict computer system performance, these approaches rely on complete labels -- i.e., sufficient examples of all possible behaviors, including straggling and non-straggling -- or strong assumptions about the underlying latency distributions -- e.g., whether Gaussian or not. Within a running job, however, none of this information is available until stragglers have revealed themselves when they have already delayed the job. To predict stragglers accurately and early…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Advanced Neural Network Applications
