Statistical properties of large data sets with linear latent features
Philipp Fleig, Ilya Nemenman

TL;DR
This paper analytically investigates how low-dimensional latent features manifest in large high-dimensional datasets, revealing their statistical signatures in correlations and eigenvalues across various data regimes.
Contribution
It introduces a probabilistic linear latent feature model and derives analytical and numerical results on the distributions of correlations and eigenvalues, including signal-noise boundaries.
Findings
Latent features leave characteristic signatures in correlation distributions.
Analytic estimates for signal-noise boundaries without spectral gaps.
Provides a comprehensive understanding of latent structure detection in large data.
Abstract
Analytical understanding of how low-dimensional latent features reveal themselves in large-dimensional data is still lacking. We study this by defining a linear latent feature model with additive noise constructed from probabilistic matrices, and analytically and numerically computing the statistical distributions of pairwise correlations and eigenvalues of the correlation matrix. This allows us to resolve the latent feature structure across a wide range of data regimes set by the number of recorded variables, observations, latent features and the signal-to-noise ratio. We find a characteristic imprint of latent features in the distribution of correlations and eigenvalues and provide an analytic estimate for the boundary between signal and noise even in the absence of a clear spectral gap.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
