# An Alternative to Cells for Selective Execution of Data Science   Pipelines

**Authors:** Lars Reimann, G\"unter Kniesel-W\"unsche

arXiv: 2302.14556 · 2023-04-10

## TL;DR

This paper proposes replacing code cells in notebooks with variable-based actions driven by data-flow analysis, improving the management and execution of data science pipelines by reducing errors and clutter.

## Contribution

It introduces a novel approach that uses variable context menus and data-flow analysis to replace code cells for selective execution in data science notebooks.

## Key findings

- Reduces errors caused by incorrect cell execution order
- Separates pipeline code from decision-making code
- Automates dependency management for data science workflows

## Abstract

Data Scientists often use notebooks to develop Data Science (DS) pipelines, particularly since they allow to selectively execute parts of the pipeline. However, notebooks for DS have many well-known flaws. We focus on the following ones in this paper: (1) Notebooks can become littered with code cells that are not part of the main DS pipeline but exist solely to make decisions (e.g. listing the columns of a tabular dataset). (2) While users are allowed to execute cells in any order, not every ordering is correct, because a cell can depend on declarations from other cells. (3) After making changes to a cell, this cell and all cells that depend on changed declarations must be rerun. (4) Changes to external values necessitate partial re-execution of the notebook. (5) Since cells are the smallest unit of execution, code that is unaffected by changes, can inadvertently be re-executed.   To solve these issues, we propose to replace cells as the basis for the selective execution of DS pipelines. Instead, we suggest populating a context-menu for variables with actions fitting their type (like listing columns if the variable is a tabular dataset). These actions are executed based on a data-flow analysis to ensure dependencies between variables are respected and results are updated properly after changes. Our solution separates pipeline code from decision making code and automates dependency management, thus reducing clutter and the risk of making errors.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.14556/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/2302.14556/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/2302.14556/full.md

---
Source: https://tomesphere.com/paper/2302.14556