Intrinsic dimension estimation for discrete metrics
Iuri Macocco, Aldo Glielmo, Jacopo Grilli, Alessandro Laio

TL;DR
This paper introduces a new algorithm to estimate the intrinsic dimension of datasets in discrete spaces, addressing limitations of existing methods designed for continuous data, and demonstrates its effectiveness on various datasets including biological data.
Contribution
The paper presents a novel intrinsic dimension estimation algorithm specifically for discrete metric spaces, filling a gap in existing dimensionality reduction techniques.
Findings
Accurate ID estimation on benchmark datasets
Application to metagenomic data reveals low-dimensional structure
Evolutive processes may operate on low-dimensional manifolds
Abstract
Real world-datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting, finding a surprisingly small ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Bioinformatics and Genomic Networks · Genomics and Phylogenetic Studies
