Modeling Data Analytic Iteration With Probabilistic Outcome Sets
Roger D. Peng, Stephanie C. Hicks

TL;DR
This paper introduces a formal probabilistic model for iterative data analysis, emphasizing how analysts compare expectations with observations to make decisions, aiming to enhance understanding and guidance in exploratory data analysis.
Contribution
It proposes a novel model based on probabilistic outcome sets and information gain to formalize decision-making in data analysis, extending traditional approaches.
Findings
Framework characterizes common data analysis situations.
Defines criteria for expected and anomaly information gain.
Guides iterative decision-making in exploratory analysis.
Abstract
In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. In this paper, we propose a model for the iterative process of data analysis based on the analyst's expectations, using what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Here, we extend the basic idea of comparing an analyst's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making
