Unsupervised detection of semantic correlations in big data
Santiago Acevedo, Alex Rodriguez, Alessandro Laio

TL;DR
This paper introduces a method to detect semantic correlations in large, high-dimensional binary data by estimating the intrinsic dimension, enabling analysis of complex data structures and neural network representations.
Contribution
It presents a novel algorithm for estimating the intrinsic dimension of big data, which is robust to high dimensionality and reveals semantic correlations in images and text.
Findings
Identified phase transitions in model magnetic systems.
Detected semantic correlations in neural network representations.
Demonstrated robustness of the method in big data scenarios.
Abstract
In real-world data, information is stored in extremely large feature vectors. These variables are typically correlated due to complex interactions involving many features simultaneously. Such correlations qualitatively correspond to semantic roles and are naturally recognized by both the human brain and artificial neural networks. This recognition enables, for instance, the prediction of missing parts of an image or text based on their context. We present a method to detect these correlations in high-dimensional data represented as binary numbers. We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, and is therefore a proxy of semantic complexity. The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis. We test…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Mining Algorithms and Applications
