Beyond Reproducible Research: Building a Formal Representation of a Data Analysis
Roger D. Peng

TL;DR
This paper proposes creating a formal, logical representation of data analyses to improve understanding, evaluation, and reproducibility beyond traditional code sharing methods.
Contribution
It introduces a formal representation framework for data analysis that captures logical reasoning, enabling analysis evaluation without data and visualization of assumptions.
Findings
Formal representation captures analysis reasoning.
Allows evaluation without data access.
Facilitates understanding of assumptions.
Abstract
Data analyses are often constructed in an imperative manner, where commands representing actions taken on the data are issued sequentially. The publication of these commands, along with the data, is essential to the reproducibility of the analysis by others. However, simply presenting the code and the results of running the code can hide important details about the data analyst's premises, expectations, and assumptions about the data. Understanding this analysis reasoning can be critical to evaluating the quality of an analysis and for suggesting possible improvements. We argue that a formal representation of a data analysis that externalizes its logical construction offers more useful information for statically illustrating an analyst's reasoning. Such a formal representation would allow for the evaluation of some aspects of a data analysis without the need for the data, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, programming, and type systems · Scientific Computing and Data Management · Advanced Database Systems and Queries
