# skDER and CiDDER: two scalable approaches for microbial genome dereplication

**Authors:** Rauf Salamzade, Aamuktha Kottapalli, Lindsay R. Kalan

PMC · DOI: 10.1099/mgen.0.001438 · Microbial Genomics · 2025-07-10

## TL;DR

The paper introduces two tools, skDER and CiDDER, to efficiently select representative microbial genomes for comparative studies, reducing computational burden and bias.

## Contribution

The novel contribution is the development of skDER and CiDDER, scalable dereplication tools that use ANI and pangenome saturation to select representative genomes.

## Key findings

- skDER efficiently dereplicates thousands of microbial genomes with high pangenome coverage and adherence to user-defined cutoffs.
- CiDDER offers an alternative to ANI-based dereplication by optimizing genome selection based on pangenome saturation of protein-coding genes.
- Both tools include auxiliary functionalities like automated genome downloading and filtering of plasmids and phages.

## Abstract

An abundance of microbial genomes have been sequenced in the past two decades. For fundamental comparative genomic investigations, where the goal is to determine the major gain and loss events shaping the pangenome of a species or broader taxon, it is often unnecessary and computationally onerous to include all available genomes in studies. In addition, the over-representation of specific lineages due to sampling and sequencing bias can have undesired effects on evolutionary analyses. To assist users with genomic dereplication, we developed skDER and CiDDER (https://github.com/raufs/skDER) to select a subset of representative genomes for downstream comparative genomic investigations. skDER is a nucleotide-based genomic dereplication tool that can dereplicate thousands of microbial genomes leveraging recent advances in average nucleotide identity (ANI) inference. CiDDER dereplicates microbial genomes based on saturation assessment of distinct protein-coding genes. To support usability, auxiliary functionalities are incorporated for testing the number of representative genomes resulting from applying various clustering parameters, automated downloading of genomes belonging to a bacterial species or genus, clustering non-representative genomes to their closest representative genomes and filtering plasmids and phages prior to dereplication. From benchmarking against other ANI-based dereplication tools, skDER, when run in the default mode, was efficient and achieved comparable pangenome coverage and strictly adhered to user-defined cutoffs for both ANI and aligned fraction (AF). Further, we showcase that CiDDER is a convenient alternative to ANI-based dereplication that allows users to more directly optimize the selection of representative genomes to cover a large breadth of a taxon’s pangenome.

## Full-text entities

- **Diseases:** AF (MESH:D054144), GTDB (MESH:D042822)
- **Chemicals:** granet (-), charcoal (MESH:D002606)
- **Species:** Eolophus roseicapilla (galah, species) [taxon 176039], Cutibacterium avidum (species) [taxon 33010], Enterococcus faecalis (species) [taxon 1351]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12245536/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12245536/full.md

## References

62 references — full list in the complete paper: https://tomesphere.com/paper/PMC12245536/full.md

---
Source: https://tomesphere.com/paper/PMC12245536