Smallset Timelines: A Visual Representation of Data Preprocessing Decisions
Lydia R. Lucchesi, Petra M. Kuhnert, Jenny L. Davis, Lexing Xie

TL;DR
The paper introduces Smallset Timelines, a visualization tool that helps data scientists document, reflect on, and communicate data preprocessing decisions effectively, enhancing transparency and reproducibility.
Contribution
It presents a novel visualization method and an R package for creating Smallset Timelines from preprocessing scripts, improving data provenance documentation.
Findings
Visualizes preprocessing decisions through Smallset snapshots
Highlights dataset alterations with color coding
Demonstrates use cases in software defect and income survey data
Abstract
Data preprocessing is a crucial stage in the data analysis pipeline, with both technical and social aspects to consider. Yet, the attention it receives is often lacking in research practice and dissemination. We present the Smallset Timeline, a visualisation to help reflect on and communicate data preprocessing decisions. A "Smallset" is a small selection of rows from the original dataset containing instances of dataset alterations. The Timeline is comprised of Smallset snapshots representing different points in the preprocessing stage and captions to describe the alterations visualised at each point. Edits, additions, and deletions to the dataset are highlighted with colour. We develop the R software package, smallsets, that can create Smallset Timelines from R and Python data preprocessing scripts. Constructing the figure asks practitioners to reflect on and revise decisions as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Analysis with R · Scientific Computing and Data Management · Data Visualization and Analytics
