Efficiently Reproducing Distributed Workflows in Notebook-based Systems
Talha Azaz, Raza Ahmad, Md Saiful Islam, Douglas Thain, Tanu Malik

TL;DR
NBRewind is a system that enhances reproducibility and efficiency in executing distributed workflows within notebooks by enabling incremental checkpointing and partial re-execution.
Contribution
It introduces a novel kernel system with audit and repeat kernels that perform incremental checkpointing and partial re-execution for distributed workflows in notebooks.
Findings
Incremental checkpoints add minimal overhead.
Checkpoints and logs improve sharing and reproducibility.
Enables cross-site reproducibility on HPC systems.
Abstract
Notebooks provide an author-friendly environment for iterative development, modular execution, and easy sharing. Distributed workflows are increasingly being authored and executed in notebooks, yet sharing and reproducing them remains challenging. Even small code or parameter changes often force full end-to-end re-execution of the distributed workflow, limiting iterative development for such workloads. Current methods for improving notebook execution operate on single-node workflows, while optimization techniques for distributed workflows typically sacrifice reproducibility. We introduce NBRewind, a notebook kernel system for efficient, reproducible execution of distributed workflows in notebooks. NBRewind consists of two kernels--audit and repeat. The audit kernel performs incremental, cell-level checkpointing to avoid unnecessary re-runs; repeat reconstructs checkpoints and enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
