LakeMLB: Data Lake Machine Learning Benchmark
Feiyu Pan, Tianbin Zhang, Aoqian Zhang, Yu Sun, Zheng Wang, Lixing Chen, Li Pan, Jianhua Li

TL;DR
LakeMLB is a comprehensive benchmark designed to evaluate machine learning performance in data lake environments, focusing on multi-table scenarios with diverse real-world datasets and integration strategies.
Contribution
It introduces standardized multi-table benchmarks for data lakes, covering key scenarios and providing datasets and code for rigorous ML research in this domain.
Findings
Performance varies significantly across integration strategies.
State-of-the-art methods show strengths and weaknesses in complex data lake scenarios.
The benchmark facilitates systematic evaluation of ML approaches in data lakes.
Abstract
Modern data lakes have emerged as foundational platforms for large-scale machine learning, enabling flexible storage of heterogeneous data and structured analytics through table-oriented abstractions. Despite their growing importance, standardized benchmarks for evaluating machine learning performance in data lake environments remain scarce. To address this gap, we present LakeMLB (Data Lake Machine Learning Benchmark), designed for the most common multi-source, multi-table scenarios in data lakes. LakeMLB focuses on two representative multi-table scenarios, Union and Join, and provides three real-world datasets for each scenario, covering government open data, finance, Wikipedia, and online marketplaces. The benchmark supports three representative integration strategies: pre-training-based, data augmentation-based, and feature augmentation-based approaches. We conduct extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Big Data and Business Intelligence · Machine Learning and Data Classification
