Create Benchmarks for Data Lakes
Yi Lyu, Pei-Chieh Lo, and Natan Lidukhover

TL;DR
This paper introduces a comprehensive benchmarking framework for data lakes, addressing the lack of standardized evaluation tools by covering diverse data types and workloads, and enabling fair comparison of different systems.
Contribution
The paper presents a new extensible and reproducible benchmark specifically designed for data lakes, including multiple data types and workload models not covered by existing benchmarks.
Findings
Benchmark effectively compares commercial and open-source data lakes
Performance metrics include query time and metadata handling
Demonstrates applicability on CloudLab platform
Abstract
Data lakes have emerged as a flexible and scalable solution for storing and analyzing large volumes of heterogeneous data, including structured, semi-structured, and unstructured formats. Despite their growing adoption in both industry and academia, there is a lack of standardized and comprehensive benchmarks for evaluating the performance of data lake systems. Existing benchmarks primarily target traditional data warehouses and focus on structured SQL workloads, making them insufficient for capturing the diverse workloads and access patterns typical of data lakes. In this work, we propose a new benchmarking framework for data lakes that aims to provide an objective and comparative evaluation of different data lake implementations. Our benchmark covers multiple data types and workload models, including data retrieval, aggregation, querying, and similarity search, which is a common yet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Research Data Management Practices
