Efficiently Discovering Locally Exceptional yet Globally Representative Subgroups
Janis Kalofolias, Mario Boley, Jilles Vreeken

TL;DR
This paper introduces an efficient method for discovering subgroups that are both exceptional in their target variable and representative of the global distribution with respect to a control variable, improving interpretability and speed.
Contribution
It formalizes an objective for finding globally representative subgroups and provides an efficient algorithm to identify top-k such subgroups using branch-and-bound.
Findings
Discoveries are meaningful and representative of the global distribution.
Algorithm is up to orders of magnitude faster than previous methods.
Effective on a wide range of datasets.
Abstract
Subgroup discovery is a local pattern mining technique to find interpretable descriptions of sub-populations that stand out on a given target variable. That is, these sub-populations are exceptional with regard to the global distribution. In this paper we argue that in many applications, such as scientific discovery, subgroups are only useful if they are additionally representative of the global distribution with regard to a control variable. That is, when the distribution of this control variable is the same, or almost the same, as over the whole data. We formalise this objective function and give an efficient algorithm to compute its tight optimistic estimator for the case of a numeric target and a binary control variable. This enables us to use the branch-and-bound framework to efficiently discover the top- subgroups that are both exceptional as well as representative.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Imbalanced Data Classification Techniques · Algorithms and Data Compression
