Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Jes\'us Camacho-Rodr\'iguez; Ashutosh Chauhan; Alan Gates; Eugene; Koifman; Owen O'Malley; Vineet Garg; Zoltan Haindrich; Sergey Shelukhin,; Prasanth Jayachandran; Siddharth Seth; Deepak Jaiswal; Slim Bouguerra,; Nishant Bangarwa; Sankar Hariappan; Anishek Agarwal; Jason Dere; Daniel Dai,; Thejas Nair; Nita Dembla; Gopal Vijayaraghavan; G\"unther Hagleitner

arXiv:1903.10970·cs.DB·March 27, 2019·6 cites

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Jes\'us Camacho-Rodr\'iguez, Ashutosh Chauhan, Alan Gates, Eugene, Koifman, Owen O'Malley, Vineet Garg, Zoltan Haindrich, Sergey Shelukhin,, Prasanth Jayachandran, Siddharth Seth, Deepak Jaiswal, Slim Bouguerra,, Nishant Bangarwa, Sankar Hariappan, Anishek Agarwal, Jason Dere

PDF

Open Access

TL;DR

This paper details the evolution of Apache Hive from a batch processing tool to an enterprise-grade data warehousing system, highlighting innovations in architecture, optimization, and performance.

Contribution

It introduces a hybrid architecture combining MPP and cloud concepts, along with enhancements in transactions, optimizer, runtime, and federation for big data analytics.

Findings

01

Demonstrates improved performance for typical workloads

02

Shows scalability with hybrid architecture

03

Details enhancements in system components

Abstract

Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications. We explore the system by detailing enhancements along four main axis: Transactions, optimizer, runtime, and federation. We then provide experimental results to demonstrate the performance of the system for typical workloads and conclude with a look at the community roadmap.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Advanced Database Systems and Queries · Data Quality and Management