A System for Quantifying Data Science Workflows with Fine-Grained Procedural Logging and a Pilot Study
Jinjin Zhao, Avidgor Gal, Sanjay Krishnan

TL;DR
This paper introduces DataInquirer, a system for detailed, automated logging of data science workflows in Jupyter notebooks, enabling analysis of programming patterns, timing, and variability among data scientists without manual annotation.
Contribution
The paper presents a novel system for fine-grained, automated tracking of data science activities and demonstrates its use in analyzing variability and influence of AI tools in data analysis workflows.
Findings
Significant differences in conclusions among data scientists analyzing the same data.
AI-powered code tools influence workflow similarity to experts.
Quantitative measurement of data science workflow variability.
Abstract
It is important for researchers to understand precisely how data scientists turn raw data into insights, including typical programming patterns, workflow, and methodology. This paper contributes a novel system, called DataInquirer, that tracks incremental code executions in Jupyter notebooks (a type of computational notebook). The system allows us to quantitatively measure timing, workflow, and operation frequency in data science tasks without resorting to human annotation or interview. In a series of pilot studies, we collect 97 traces, logging data scientist activities across four studies. While this paper presents a general system and data analysis approach, we focus on a foundational sub-question in our pilot studies: How consistent are different data scientists in analyzing the same data? We taxonomize variation between data scientists on the same dataset according to three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Big Data and Business Intelligence
