A Probabilistic Model for Data Redundancy in the Feature Domain
Ghurumuruhan Ganesan

TL;DR
This paper introduces a probabilistic model to estimate the count of uncorrelated features in large datasets, accounting for pairwise and multicollinearity, and provides bounds on feature set sizes with low collinearity.
Contribution
It presents a novel probabilistic approach to quantify uncorrelated features considering complex feature interdependencies, with theoretical bounds and auxiliary results.
Findings
Provides upper and lower bounds for feature set sizes
Models both pairwise and multicollinearity effects
Includes an auxiliary result on constrained sets
Abstract
In this paper, we use a probabilistic model to estimate the number of uncorrelated features in a large dataset. Our model allows for both pairwise feature correlation (collinearity) and interdependency of multiple features (multicollinearity) and we use the probabilistic method to obtain upper and lower bounds of the same order, for the size of a feature set that exhibits low collinearity and low multicollinearity. We also prove an auxiliary result regarding mutually good constrained sets that is of independent interest.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models
