Building Wavelet Histograms on Large Data in MapReduce

Jeffrey Jestes; Ke Yi; Feifei Li

arXiv:1110.6649·cs.DB·November 1, 2011

Building Wavelet Histograms on Large Data in MapReduce

Jeffrey Jestes, Ke Yi, Feifei Li

PDF

Open Access

TL;DR

This paper presents new algorithms for efficiently constructing wavelet histograms on large datasets using MapReduce, significantly improving performance over existing methods.

Contribution

The paper introduces novel algorithms for exact and approximate wavelet histogram construction optimized for MapReduce environments, demonstrating substantial efficiency gains.

Findings

01

Significant reduction in computation time and communication costs.

02

Order-of-magnitude performance improvements over baseline methods.

03

Effective implementation in Hadoop with large real and synthetic datasets.

Abstract

MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Graph Theory and Algorithms · Data Management and Algorithms