Characterizing and Subsetting Big Data Workloads

Zhen Jia; Jianfeng Zhan; Lei Wang; Rui Han; Sally A. McKee; Qiang; Yang; Chunjie Luo; and Jingwei Li

arXiv:1409.0792·cs.PF·November 15, 2016

Characterizing and Subsetting Big Data Workloads

Zhen Jia, Jianfeng Zhan, Lei Wang, Rui Han, Sally A. McKee, Qiang, Yang, Chunjie Luo, and Jingwei Li

PDF

TL;DR

This paper uses PCA and clustering to analyze and select representative workloads from BigDataBench, facilitating more efficient and diverse benchmarking of big data systems.

Contribution

It introduces a methodology combining PCA and clustering to characterize and subset big data workloads, improving benchmark diversity and efficiency.

Findings

01

Identified key workload characteristics using PCA

02

Clustered workloads based on principal components

03

Selected seven representative workloads for benchmarking

Abstract

Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates hese challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.