Unsupervised detection of semantic correlations in big data

Santiago Acevedo; Alex Rodriguez; Alessandro Laio

arXiv:2411.02126·cs.LG·May 22, 2025

Unsupervised detection of semantic correlations in big data

Santiago Acevedo, Alex Rodriguez, Alessandro Laio

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method to detect semantic correlations in large, high-dimensional binary data by estimating the intrinsic dimension, enabling analysis of complex data structures and neural network representations.

Contribution

It presents a novel algorithm for estimating the intrinsic dimension of big data, which is robust to high dimensionality and reveals semantic correlations in images and text.

Findings

01

Identified phase transitions in model magnetic systems.

02

Detected semantic correlations in neural network representations.

03

Demonstrated robustness of the method in big data scenarios.

Abstract

In real-world data, information is stored in extremely large feature vectors. These variables are typically correlated due to complex interactions involving many features simultaneously. Such correlations qualitatively correspond to semantic roles and are naturally recognized by both the human brain and artificial neural networks. This recognition enables, for instance, the prediction of missing parts of an image or text based on their context. We present a method to detect these correlations in high-dimensional data represented as binary numbers. We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, and is therefore a proxy of semantic complexity. The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis. We test…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

acevedo-s/bid
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Data Mining Algorithms and Applications