Shark: SQL and Rich Analytics at Scale
Reynold Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott, Shenker, Ion Stoica

TL;DR
Shark is a scalable data analysis system that combines SQL query processing with complex analytics, achieving significant speedups over Hadoop and Hive while maintaining fault tolerance and flexibility.
Contribution
It introduces a unified engine with a novel distributed memory abstraction enabling fast SQL and analytics at scale, combining MPP performance with MapReduce-like fault tolerance.
Findings
SQL queries up to 100x faster than Hive
Machine learning programs up to 100x faster than Hadoop
Achieves speedups similar to MPP databases with fault tolerance
Abstract
Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100x faster than Apache Hive, and machine learning programs up to 100x faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engines provide. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Cloud Computing and Resource Management · Distributed systems and fault tolerance
