A Probabilistic Model for Data Redundancy in the Feature Domain

Ghurumuruhan Ganesan

arXiv:2309.13657·cs.LG·September 26, 2023

A Probabilistic Model for Data Redundancy in the Feature Domain

Ghurumuruhan Ganesan

PDF

Open Access

TL;DR

This paper introduces a probabilistic model to estimate the count of uncorrelated features in large datasets, accounting for pairwise and multicollinearity, and provides bounds on feature set sizes with low collinearity.

Contribution

It presents a novel probabilistic approach to quantify uncorrelated features considering complex feature interdependencies, with theoretical bounds and auxiliary results.

Findings

01

Provides upper and lower bounds for feature set sizes

02

Models both pairwise and multicollinearity effects

03

Includes an auxiliary result on constrained sets

Abstract

In this paper, we use a probabilistic model to estimate the number of uncorrelated features in a large dataset. Our model allows for both pairwise feature correlation (collinearity) and interdependency of multiple features (multicollinearity) and we use the probabilistic method to obtain upper and lower bounds of the same order, for the size of a feature set that exhibits low collinearity and low multicollinearity. We also prove an auxiliary result regarding mutually good constrained sets that is of independent interest.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Methods and Models