Pangea: Monolithic Distributed Storage for Data Analytics

Jia Zou; Arun Iyengar; Chris Jermaine

arXiv:1808.06094·cs.DC·December 18, 2018·1 cites

Pangea: Monolithic Distributed Storage for Data Analytics

Jia Zou, Arun Iyengar, Chris Jermaine

PDF

Open Access

TL;DR

Pangea is a unified monolithic storage system designed for data analytics that simplifies data management, reduces redundancy, and improves performance compared to layered systems like Spark and HDFS.

Contribution

It introduces a single, integrated storage system that manages all data types and operations without layering, enhancing efficiency and simplifying data management.

Findings

01

Pangea's performance is comparable or superior to layered systems like Spark.

02

Reduces data copying and management overhead.

03

Improves resource utilization in data analytics workloads.

Abstract

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as distributed file system like HDFS, in-memory file system like Alluxio and computation framework like Spark. Such layering introduces significant performance and management costs for copying data across layers redundantly and deciding proper resource allocation for all layers. In this paper we propose a single system called Pangea that can manage all data---both intermediate and long-lived data, and their buffer/caching, data placement optimization, and failure recovery---all in one monolithic storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques