# Estimating species commonness and prevalence through unsupervised methods

**Authors:** Pasquale Bove, Andrea Bertini, Gianpaolo Coro

PMC · DOI: 10.1038/s41598-026-38900-1 · 2026-02-11

## TL;DR

This paper introduces an unsupervised method to estimate how common species are in an area, improving ecological niche models using data from biodiversity databases.

## Contribution

A novel, data-driven, unsupervised multi-species methodology for estimating species prevalence in ecological niche models.

## Key findings

- A deep-learning model achieved the highest accuracy (~81–90%) in classifying species prevalence.
- The methodology is scalable and reproducible, using clustering and statistical analysis for prevalence estimation.
- The approach was validated in a case study of 161 species in the Massaciuccoli Lake basin.

## Abstract

The prevalence of a species in a given area is crucial for estimating the environmental conditions associated with its subsistence within ecological niche models (ENMs). Prevalence is defined as the proportion of presences relative to the total number of sampled sites, reflecting prior expectation on species commonness or rarity. However, reliable estimation often faces challenges due to limited or biased occurrence data, particularly for rare or poorly monitored species. This work presents a data-driven, multi-species methodology to estimate species prevalence for use in ENMs. It leverages species occurrence records from the Global Biodiversity Information Facility and is entirely unsupervised. It utilises two clustering methods, one deep-learning model, and an ensemble model, plus statistical analysis to classify species commonness and transform classifications into prevalence probabilities. A case study is presented for 161 species living in the Massaciuccoli Lake basin (Tuscany, Italy), a wetland of high biodiversity value and ecological sensitivity. The models classified the species’ prevalence based on observations from other Italian wetland sites, and were evaluated against expert-based assessments. All models achieved high accuracy, with the deep-learning model achieving the highest (~ 81–90%). The proposed methodology is scalable and reproducible and can inform ENMs with objective, robust prevalence estimates.

## Full-text entities

- **Diseases:** fire (MESH:D000092422), ENMs (MESH:D004195)
- **Chemicals:** F (MESH:D005461), E (MESH:D004540)
- **Species:** Tarentola mauritanica (common wall gecko, species) [taxon 8569], Olea europaea (common olive, species) [taxon 4146], Micropterus salmoides (largemouth bass, species) [taxon 27706], Halyomorpha halys (brown marmorated stink bug, species) [taxon 286706], Acrocephalus arundinaceus (great reed warbler, species) [taxon 39621], Ischnura elegans (species) [taxon 197161], Ameiurus melas (black bullhead, species) [taxon 219545], Plegadis falcinellus (species) [taxon 52788], Cyprinus carpio (carp, species) [taxon 7962], Tinca tinca (tench, species) [taxon 27717], Procambarus clarkii (red swamp crayfish, species) [taxon 6728], Homo sapiens (human, species) [taxon 9606], Acrocephalus melanopogon (moustached warbler, species) [taxon 68470], Podarcis muralis (Common wall lizard, species) [taxon 64176], Haematopus ostralegus (Eurasian oystercatcher, species) [taxon 31908], Ichthyaetus melanocephalus (mediterranean gull, species) [taxon 1288290], Acrocephalus schoenobaenus (sedge warbler, species) [taxon 52609]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12966340/full.md

---
Source: https://tomesphere.com/paper/PMC12966340