Shark: SQL and Rich Analytics at Scale

Reynold Xin; Josh Rosen; Matei Zaharia; Michael J. Franklin; Scott; Shenker; Ion Stoica

arXiv:1211.6176·cs.DB·November 28, 2012·19 cites

Shark: SQL and Rich Analytics at Scale

Reynold Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott, Shenker, Ion Stoica

PDF

Open Access

TL;DR

Shark is a scalable data analysis system that combines SQL query processing with complex analytics, achieving significant speedups over Hadoop and Hive while maintaining fault tolerance and flexibility.

Contribution

It introduces a unified engine with a novel distributed memory abstraction enabling fast SQL and analytics at scale, combining MPP performance with MapReduce-like fault tolerance.

Findings

01

SQL queries up to 100x faster than Hive

02

Machine learning programs up to 100x faster than Hadoop

03

Achieves speedups similar to MPP databases with fault tolerance

Abstract

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100x faster than Apache Hive, and machine learning programs up to 100x faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engines provide. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Cloud Computing and Resource Management · Distributed systems and fault tolerance