# Topological stratification of continuous genetic variation in large biobanks

**Authors:** Alex Diaz-Papkovich, Shadi Zabad, Hannah Snell, Chief Ben-Eghan, Luke Anderson-Trocmé, Georgette Femerling, Vikram Nathan, Jenisha Patel, Simon Gravel

PMC · DOI: 10.1371/journal.pgen.1012068 · PLOS Genetics · 2026-03-16

## TL;DR

This paper introduces a new clustering method for genetic data that better captures population structure and ancestry patterns in large biobanks.

## Contribution

A topological clustering approach using UMAP and HDBSCAN that identifies continuous genetic variation without assuming fixed cluster structures.

## Key findings

- The method identifies clusters enriched for shared ancestry-related features like birth country and ethnicity.
- It distinguishes admixed populations by capturing continuous genetic variation across large distances in genotype space.
- The approach improves understanding of genetic structure and supports downstream analyses like polygenic score transferability and quality control.

## Abstract

Biobanks now contain genetic data from millions of individuals. Dimensionality reduction, visualization and clustering are standard when exploring data at these scales; while efficient and tractable methods exist for the first two, clustering remains challenging because of the many ways in which demography and sampling can affect structure. In practice, clustering is commonly performed by drawing shapes around dimensionally reduced data or assuming populations have “type” genomes or allele frequencies that represent a population. We propose to use dimensionality reduction with UMAP followed by clustering with HDBSCAN to identify sets of points forming relatively dense subsets in genotype space. The approach is fast, easy to implement, and integrates with existing pipelines. When applied to simulated data or data from three biobanks, the approach identifies groups of individuals enriched for shared features correlated with ancestry, including country of birth, ethnicity, and sampling location, without requiring strong assumptions about the number or size of clusters, or the sources of population structure. Because it does not rely on proximity to a specific point in genetic space, this topological approach can form clusters that continuously span long distances in genetic space. This can help distinguish admixed populations, which can exhibit wide ancestry variation within populations and overlap of ancestry proportions across populations. Such clusters can highlight and account for interpretable sources of genetic, demographic, or sampling heterogeneity in a dataset that would otherwise have required a range of specialized techniques. We illustrate how topological genetic strata can further help us understand structure within biobanks, evaluate distributions of genotypic and phenotypic data, examine polygenic score transferability, identify potential influential alleles, and perform quality control.

Clustering is a common approach to study large-scale genomic datasets. It can be used to investigate the relationships between genetics, demography, and biomedical/environmental variables, as well as defining subgroups for downstream study. Identifying clusters in large biobanks present challenges due to their large size, high dimensionality, and the complexity of the demographic and sampling processes that shape observed genetic diversity. We present a computationally tractable method of clustering using dimensionality reduction (UMAP) and density clustering (HDBSCAN) that captures relatedness patterns and reflects important characteristics of a biobank’s data, even where there are subpopulations of widely varying sizes. This method does not depend on external reference panels, and the clusters can be characterized post hoc using ancestry and ancestry-associated data without using reductive labels. We carry out clustering on simulated data as well as three biobanks and present a series of vignettes showing how clustering helped us identify patterns due to demographic histories and sampling strategy, with an impact on phenotype distributions, genetic risk score accuracy, and data quality control.

## Full-text entities

- **Genes:** PC (pyruvate carboxylase) [NCBI Gene 5091] {aka PCB}, HLA-A (major histocompatibility complex, class I, A) [NCBI Gene 3105] {aka HLAA}, APOE (apolipoprotein E) [NCBI Gene 348] {aka AD2, APO-E, ApoE4, LDLCQ5, LPG}
- **Diseases:** PC (MESH:D015324), IBD (MESH:D009105), CHS (MESH:D002609)
- **Chemicals:** HDBSCAN (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Mutations:** rs4420638, rs7412

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13008251/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13008251/full.md

## References

74 references — full list in the complete paper: https://tomesphere.com/paper/PMC13008251/full.md

---
Source: https://tomesphere.com/paper/PMC13008251