TL;DR
This paper introduces a new non-parametric estimator for reliable correlation measurement in categorical data and provides an efficient algorithmic framework for discovering top correlated attribute sets, validated through empirical case studies.
Contribution
It proposes a corrected, consistent estimator for normalized total correlation and an effective search framework for top-k correlated sets in categorical data.
Findings
Estimator achieves low regret with small samples
Algorithms are effective for large, high-dimensional data
Framework successfully identifies meaningful correlations in case studies
Abstract
In many scientific tasks we are interested in discovering whether there exist any correlations in our data. This raises many questions, such as how to reliably and interpretably measure correlation between a multivariate set of attributes, how to do so without having to make assumptions on distribution of the data or the type of correlation, and, how to efficiently discover the top-most reliably correlated attribute sets from data. In this paper we answer these questions for discovery tasks in categorical data. In particular, we propose a corrected-for-chance, consistent, and efficient estimator for normalized total correlation, by which we obtain a reliable, naturally interpretable, non-parametric measure for correlation over multivariate sets. For the discovery of the top-k correlated sets, we derive an effective algorithmic framework based on a tight bounding function. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
