A comparison of different clustering approaches for high-dimensional presence-absence data
Gabriele d'Angella, Christian Hennig

TL;DR
This paper compares various clustering methods for high-dimensional presence-absence data, evaluating their performance through extensive simulations based on species distribution models.
Contribution
It provides a comprehensive comparison of latent class, hierarchical, and multidimensional scaling clustering approaches for presence-absence data.
Findings
Latent class clustering performs well with certain data structures.
Distance-based methods are computationally efficient.
Multidimensional scaling approaches offer a useful alternative.
Abstract
Presence-absence data is defined by vectors or matrices of zeroes and ones, where the ones usually indicate a "presence" in a certain place. Presence-absence data occur for example when investigating geographical species distributions, genetic information, or the occurrence of certain terms in texts. There are many applications for clustering such data; one example is to find so-called biotic elements, i.e., groups of species that tend to occur together geographically. Presence-absence data can be clustered in various ways, namely using a latent class mixture approach with local independence, distance-based hierarchical clustering with the Jaccard distance, or also using clustering methods for continuous data on a multidimensional scaling representation of the distances. These methods are conceptually very different and can therefore not easily be compared theoretically. We compare…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research
