TL;DR
This paper introduces a MDL-based formulation for subgroup set discovery targeting numeric data, enabling the extraction of non-redundant, interpretable subgroup lists with strong deviations and low spread.
Contribution
It formalizes a dispersion-aware MDL approach for subgroup set discovery and proposes SSD++, a heuristic algorithm that finds high-quality, non-redundant subgroup lists.
Findings
SSD++ produces compact, non-redundant subgroup lists
The method effectively balances subgroup quality and complexity
Empirical results show superior performance over existing methods
Abstract
The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMinimum Description Length · Symbolic rule learning · SSD
