Training Data Reduction for Performance Models of Data Analytics Jobs in   the Cloud

Jonathan Will; Onur Arslan; Jonathan Bader; Dominik Scheinert; Lauritz; Thamsen

arXiv:2111.07904·cs.DC·March 14, 2022

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

Jonathan Will, Onur Arslan, Jonathan Bader, Dominik Scheinert, Lauritz, Thamsen

PDF

TL;DR

This paper explores data reduction techniques to minimize training data size for performance models of cloud data analytics jobs, achieving significant efficiency gains with minimal impact on accuracy.

Contribution

It introduces clustering-based data reduction methods that significantly cut training data size while maintaining model accuracy for cloud dataflow job performance prediction.

Findings

01

75% data reduction with only 1% increase in prediction error

02

Effective data transfer and storage savings achieved

03

Clustering techniques maintain model accuracy with less data

Abstract

Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and number is difficult, especially for users who do not have access to previous performance metrics. One approach to overcoming this issue is to have users share runtime metrics to train context-aware performance models that help find a suitable configuration for the job at hand. A problem when sharing runtime data instead of trained models or model parameters is that the data size can grow substantially over time. This paper examines several clustering techniques to minimize training data size while keeping the associated performance models accurate. Our results indicate that efficiency gains in data transfer, storage, and model training can be achieved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.