Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems
Ivan Borodii, Halyna Osukhivska

TL;DR
This study compares the performance and storage efficiency of Apache Hudi, Apache Iceberg, and Delta Lake Data Lakehouse systems using Apache Spark for structured and semi-structured data, guiding architecture selection.
Contribution
It provides the first performance and storage comparison of these three Data Lakehouse systems, aiding data engineers in architecture choice based on data volume and performance needs.
Findings
Delta Lake offers the fastest data loading across data types.
Apache Iceberg provides better storage efficiency and stability.
Apache Hudi is less effective for loading and storage but may excel in streaming scenarios.
Abstract
The paper presents a study of the efficiency of loading and storing data in the three most common Data Lakehouse systems, including Apache Hudi, Apache Iceberg, and Delta Lake, using Apache Spark as a distributed data processing platform. The study analyzes the behavior of each system when processing structured (CSV) and semi-structured (JSON) data of different sizes, including loading files up to 7 GB in size. The purpose of the work is to determine the most optimal Data Lakehouse architecture based on the type and volume of data sources, data loading performance using Apache Spark, and disk size of data for forming analytical data systems. The research covers the development of four sequential ETL processes, which include reading, transforming, and loading data into tables in each of the Data Lakehouse systems. The efficiency of each Lakehouse was evaluated according to two key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
