Comparing Apples and Oranges: Measuring Differences between Exploratory Data Mining Results
Nikolaj Tatti, Jilles Vreeken

TL;DR
This paper introduces a novel information-theoretic measure to compare exploratory data mining results, especially on binary data, enabling meaningful assessment of differences and aiding iterative data analysis.
Contribution
It proposes a flexible, interpretable measure based on Maximum Entropy and Kullback-Leibler divergence that incorporates background knowledge and generalizes existing dissimilarity metrics.
Findings
The measure effectively distinguishes between different mining methods.
It can identify parts of results that redescribe others.
It supports iterative data mining for maximal novel insights.
Abstract
Deciding whether the results of two different mining algorithms provide significantly different information is an important, yet understudied, open problem in exploratory data mining. Whether the goal is to select the most informative result for analysis, or to decide which mining approach will most likely provide the most novel insight, it is essential that we can tell how different the information is that different results by possibly different methods provide. In this paper we take a first step towards comparing exploratory data mining results on binary data. We propose to meaningfully convert results into sets of noisy tiles, and compare between these sets by Maximum Entropy modelling and Kullback-Leibler divergence, well-founded notions from Information Theory. We so construct a measure that is highly flexible, and allows us to naturally include background knowledge, such that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
