Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems

Ivan Borodii; Halyna Osukhivska

arXiv:2604.21449·cs.DC·April 24, 2026

Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems

Ivan Borodii, Halyna Osukhivska

PDF

TL;DR

This study compares the performance and storage efficiency of Apache Hudi, Apache Iceberg, and Delta Lake Data Lakehouse systems using Apache Spark for structured and semi-structured data, guiding architecture selection.

Contribution

It provides the first performance and storage comparison of these three Data Lakehouse systems, aiding data engineers in architecture choice based on data volume and performance needs.

Findings

01

Delta Lake offers the fastest data loading across data types.

02

Apache Iceberg provides better storage efficiency and stability.

03

Apache Hudi is less effective for loading and storage but may excel in streaming scenarios.

Abstract

The paper presents a study of the efficiency of loading and storing data in the three most common Data Lakehouse systems, including Apache Hudi, Apache Iceberg, and Delta Lake, using Apache Spark as a distributed data processing platform. The study analyzes the behavior of each system when processing structured (CSV) and semi-structured (JSON) data of different sizes, including loading files up to 7 GB in size. The purpose of the work is to determine the most optimal Data Lakehouse architecture based on the type and volume of data sources, data loading performance using Apache Spark, and disk size of data for forming analytical data systems. The research covers the development of four sequential ETL processes, which include reading, transforming, and loading data into tables in each of the Data Lakehouse systems. The efficiency of each Lakehouse was evaluated according to two key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.