Finding Statistically Significant Attribute Interactions
Andreas Henelius, Antti Ukkonen, Kai Puolam\"aki

TL;DR
This paper introduces a statistical significance testing method and an algorithm called ASTRID to identify and analyze attribute interactions specific to a variable of interest in data, aiding in understanding data structure and feature selection.
Contribution
The paper presents a novel statistical approach and the ASTRID algorithm for automatically discovering attribute partitions that explain data generation processes, enhancing data analysis capabilities.
Findings
ASTRID effectively finds attribute partitions in real and synthetic data.
The method identifies significant attribute interactions related to the class variable.
State-of-the-art classifiers help validate the discovered interactions.
Abstract
In many data exploration tasks it is meaningful to identify groups of attribute interactions that are specific to a variable of interest. For instance, in a dataset where the attributes are medical markers and the variable of interest (class variable) is binary indicating presence/absence of disease, we would like to know which medical markers interact with respect to the binary class label. These interactions are useful in several practical applications, for example, to gain insight into the structure of the data, in feature selection, and in data anonymisation. We present a novel method, based on statistical significance testing, that can be used to verify if the data set has been created by a given factorised class-conditional joint distribution, where the distribution is parametrised by a partition of its attributes. Furthermore, we provide a method, named ASTRID, for automatically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Machine Learning and Data Classification · Data Mining Algorithms and Applications
