Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
Jes\'us Camacho-Rodr\'iguez, Ashutosh Chauhan, Alan Gates, Eugene, Koifman, Owen O'Malley, Vineet Garg, Zoltan Haindrich, Sergey Shelukhin,, Prasanth Jayachandran, Siddharth Seth, Deepak Jaiswal, Slim Bouguerra,, Nishant Bangarwa, Sankar Hariappan, Anishek Agarwal, Jason Dere

TL;DR
This paper details the evolution of Apache Hive from a batch processing tool to an enterprise-grade data warehousing system, highlighting innovations in architecture, optimization, and performance.
Contribution
It introduces a hybrid architecture combining MPP and cloud concepts, along with enhancements in transactions, optimizer, runtime, and federation for big data analytics.
Findings
Demonstrates improved performance for typical workloads
Shows scalability with hybrid architecture
Details enhancements in system components
Abstract
Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications. We explore the system by detailing enhancements along four main axis: Transactions, optimizer, runtime, and federation. We then provide experimental results to demonstrate the performance of the system for typical workloads and conclude with a look at the community roadmap.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Database Systems and Queries · Data Quality and Management
