Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie
Jacopo Tagliabue, Ciro Greco

TL;DR
This paper presents a system that enhances reproducibility of data pipelines over data lakes by decoupling compute from data management, enabling time-travel, branching, and easy pipeline reproduction using Bauplan and Nessie.
Contribution
The paper introduces a novel system combining Bauplan and Nessie to achieve reproducible, replayable data pipelines with minimal commands, addressing key challenges in Lakehouse architectures.
Findings
Supports time-travel and branching semantics on object storage
Enables full pipeline reproducibility with simple CLI commands
Decouples compute from data management for faster testing
Abstract
As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Data Visualization and Analytics · Big Data Technologies and Applications
