DualTable: A Hybrid Storage Model for Update Optimization in Hive
Songlin Hu, Wantao Liu, Tilmann Rabl, Shuo Huang, Ying Liang, Zheng, Xiao, Hans-Arno Jacobsen, Xubin Pei, Jiye Wang

TL;DR
DualTable is a hybrid storage model that combines HDFS and HBase to optimize update operations in Hive, significantly improving data manipulation performance while maintaining query efficiency.
Contribution
The paper introduces DualTable, a novel hybrid storage model that enhances update and delete performance in Hive without sacrificing query speed.
Findings
Hive on DualTable is up to 10 times faster for updates and deletes.
DualTable effectively combines streaming reads and random writes.
Experiments demonstrate significant performance improvements over traditional Hive storage.
Abstract
Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation.There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems
