Create Benchmarks for Data Lakes

Yi Lyu; Pei-Chieh Lo; and Natan Lidukhover

arXiv:2601.19176·cs.DB·January 28, 2026

Create Benchmarks for Data Lakes

Yi Lyu, Pei-Chieh Lo, and Natan Lidukhover

PDF

Open Access

TL;DR

This paper introduces a comprehensive benchmarking framework for data lakes, addressing the lack of standardized evaluation tools by covering diverse data types and workloads, and enabling fair comparison of different systems.

Contribution

The paper presents a new extensible and reproducible benchmark specifically designed for data lakes, including multiple data types and workload models not covered by existing benchmarks.

Findings

01

Benchmark effectively compares commercial and open-source data lakes

02

Performance metrics include query time and metadata handling

03

Demonstrates applicability on CloudLab platform

Abstract

Data lakes have emerged as a flexible and scalable solution for storing and analyzing large volumes of heterogeneous data, including structured, semi-structured, and unstructured formats. Despite their growing adoption in both industry and academia, there is a lack of standardized and comprehensive benchmarks for evaluating the performance of data lake systems. Existing benchmarks primarily target traditional data warehouses and focus on structured SQL workloads, making them insufficient for capturing the diverse workloads and access patterns typical of data lakes. In this work, we propose a new benchmarking framework for data lakes that aims to provide an objective and comparative evaluation of different data lake implementations. Our benchmark covers multiple data types and workload models, including data retrieval, aggregation, querying, and similarity search, which is a common yet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Advanced Database Systems and Queries · Research Data Management Practices