skDER and CiDDER: two scalable approaches for microbial genome dereplication
Rauf Salamzade, Aamuktha Kottapalli, Lindsay R. Kalan

TL;DR
The paper introduces two tools, skDER and CiDDER, to efficiently select representative microbial genomes for comparative studies, reducing computational burden and bias.
Contribution
The novel contribution is the development of skDER and CiDDER, scalable dereplication tools that use ANI and pangenome saturation to select representative genomes.
Findings
skDER efficiently dereplicates thousands of microbial genomes with high pangenome coverage and adherence to user-defined cutoffs.
CiDDER offers an alternative to ANI-based dereplication by optimizing genome selection based on pangenome saturation of protein-coding genes.
Both tools include auxiliary functionalities like automated genome downloading and filtering of plasmids and phages.
Abstract
An abundance of microbial genomes have been sequenced in the past two decades. For fundamental comparative genomic investigations, where the goal is to determine the major gain and loss events shaping the pangenome of a species or broader taxon, it is often unnecessary and computationally onerous to include all available genomes in studies. In addition, the over-representation of specific lineages due to sampling and sequencing bias can have undesired effects on evolutionary analyses. To assist users with genomic dereplication, we developed skDER and CiDDER (https://github.com/raufs/skDER) to select a subset of representative genomes for downstream comparative genomic investigations. skDER is a nucleotide-based genomic dereplication tool that can dereplicate thousands of microbial genomes leveraging recent advances in average nucleotide identity (ANI) inference. CiDDER dereplicates…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5- —http://dx.doi.org/10.13039/100000057 National Institute of General Medical Sciences
- —http://dx.doi.org/10.13039/100015691 Division of Microbiology and Infectious Diseases, National Institute of Allergy and Infectious Diseases
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Probiotics and Fermented Foods · Bacteriophages and microbial interactions
Data Summary
skDER and CiDDER are provided as open-source software implemented in Python and C++ on Github (https://github.com/raufs/skDER) with version updates tracked on Zenodo, https://zenodo.org/records/13887710 [1]. Installation of the software is supported via both Bioconda [2] and Docker. Additional code and data for analyses presented in this manuscript can be found on Zenodo at https://zenodo.org/records/13891800 [3]. Pre-computed representative genomes selected by skDER (v1.0.7) for 18 common bacterial taxonomic groups, referencing classifications from Genome Taxonomy Database release 214 [4], are also provided on Zenodo at https://zenodo.org/records/10041203 [5].
Introduction
The boundaries of bacterial species are commonly defined using average nucleotide identity (ANI)-based thresholds [68]. However, other types of approaches are often used for the delineation of more resolute clusters of strains or lineages within species and the selection of individual genomes to serve as representatives. Specifically, phylogenetic [9], alignment [1011] and k-mer-based [1213] approaches have been used as alternatives to ANI-based methods for such applications. While these alternative approaches have demonstrated scalability to handle species featuring thousands of genomes, they typically involve multi-step workflows, for instance, needing preliminary construction of core genome sequence alignments or phylogenies. In contrast, an approach based on ANI-cutoffs is more direct and simply requires genomic assemblies as input.
To reduce computational costs and resource requirements for downstream analyses, genome dereplication can be extremely useful. Methods that have used ANI to perform dereplication of an input set of genomes – to select representatives – have largely been designed for metagenomic applications. A common strategy of these methods [1416] involves first performing preliminary clustering using faster but less precise ANI calculators [1718] to coarsely bin related genomes together. Afterwards, a secondary, more stringent, clustering is performed using a more accurate but lower efficiency approach, such as FastANI [719]. More recently, a new method, skani, was developed for high-accuracy and efficient estimation of ANI [20], thus enabling a quick and direct approach for clustering or dereplicating large sets of similar (meta-)genomes. skani’s speed and accuracy stem from an internal, preliminary k-mer sketching step to avoid computationally costly ANI estimation for distantly related genome pairs and the use of k-mer chaining for finding orthologous regions between genomes [21].
For many comparative genomics studies, achieving nucleotide resolution diversity might be less important than understanding the key gene gain and loss events that have shaped the pangenome of the focal species or genus. While the metrics of ANI and percentage of shared genes for pairs of genomes are positively associated [722], if the researcher aims to select representative genomes that adequately sample a taxon’s pangenome space, a direct approach to maximize the number of different genes sampled by the fewest number of representative genomes might be preferred to an ANI-based approach. For such methods, recent advances in heuristics for efficient clustering of large protein datasets [2325] can be leveraged.
Here, we describe skDER and CiDDER, which implement approaches for genomic dereplication based on assessments of ANI and protein cluster saturation, respectively. skDER features two modes, ‘greedy’, which is akin to greedy set cover clustering and strictly adheres to thresholds for dereplication specified by users, and ‘dynamic’, which approximates single-linkage clustering and aims to minimize the number of representative genomes selected. We apply the dereplication algorithms primarily on the diverse genus of Enterococcus and the mobile genetic element (MGE)-rich species of Enterococcus faecalis. We demonstrate their efficiency compared to currently available genomic dereplication tools primarily designed for metagenomic applications.
Methods
Genomic dereplication approaches
Two distinct approaches for genomic dereplication are provided within skDER. Both approaches are based on first estimating ANI between pairs of genomes using skani [20]. Specifically, the skani triangle mode is applied with user-customizable parameter options. The default options for running skani triangle are mostly retained as the default options in skDER, with the one exception being an increase to the value of the screening parameter, -s, from 80.0 to 90.0, which is sensible for improving computational efficiency for most use cases of skDER. In v1.3.0 and more recent versions, the value of this screening parameter is instead by default set to X%, where X is the requested ANI cutoff minus 10%.
For this manuscript, we used version 1.2.6 of skDER and CiDDER, version 0.2.2 of skani [20], version 3.5.1 of Pyrodigal [26] and version 4.8.1 of CD-HIT [27]. Visualization to showcase dereplication was performed using the program granet (version 1.2.7), which uses the igraph library [28] and is included within the skDER package.
By default, the ANI threshold in skDER was set to 99% but increased to 99.5% in v1.3.0 to reflect recently determined thresholds for distinguishing different sequence types [29]. Because CiDDER aims to select a set of representative genomes that adequately sample the pangenome of a taxon, the default value for the percentage of total distinct protein clusters that need to be accounted for is set to 90%.
skDER greedy ANI-based dereplication
The ‘greedy’ algorithm is designed to ensure that all non-representative genomes have ANI and AF values that meet the user-specified cutoffs relative to at least one representative genome. To this end, the ‘greedy’ algorithm for dereplication within skDER begins by initializing a set called representative_genomes. Pairwise genome-to-genome similarity information from skani [20] is next assessed, where for each genome, a list is constructed of alternate genomes deemed similar to it using user-adjustable ANI and AF cutoffs. A score indicating the value of a genome to serve as a representative is computed. This score is simply the connectivity of a genome, how many other genomes it is similar to, multiplied by the N50 for the genome. Genomes are sorted in descending order by their respective score. The first genome, with the greatest score, is automatically selected as a representative and appended to the representative_genomes set. The direct list of alternate genomes regarded as redundant to this genome is noted and kept track of in another set, nonrepresentative_genomes, disqualifying them from serving as representatives. As the full list of genomes is traversed linearly, they are appended to the set of representative_genomes if they have not previously been added to the set of nonrepresentative_genomes. Similar to the first representative genome, genomes subsequently selected as representatives also have their respective sets of genomes deemed as redundant to them appended to the nonrepresentative_genomes set.
skDER dynamic ANI-based dereplication
The ‘dynamic’ algorithm for dereplication within skDER begins by initializing an empty set, redundant_genomes. Pairwise genome-to-genome similarity information from skani [20] is assessed, and, similar to the greedy algorithm-based approach, a score is used to determine the preferred genome in the pair to use as a reference. The score for a genome is computed as the product of its assembly N50 and the number of genomes that are regarded as highly similar to it based on user-adjustable ANI and AF cutoffs. Assuming the genomes are similar to each other at the required ANI and AF cutoffs and the AF for one genome is not substantially larger than the other, the genome with the lower score, less ideal to serve as a representative, is added to the set of redundant_genomes. Further, smaller genomes largely contained within larger genomes should not serve as a representative. Thus, a third filter, the difference in the AF between a pair of genomes, which is non-symmetric, is also assessed during dereplication. If this difference exceeds a certain threshold, the ANI between the genomes meets the ANI cutoff, and the AF for at least one genome exceeds the AF cutoff, then the smaller genome, with the higher AF, will by default be added to the redundant_genomes. Once all pairs of genomes are assessed, genomes that are absent in the redundant_genomes set are reported as representatives. It is important to note that this approach approximates selecting a single representative genome for each coarse transitive cluster of related genomes. Therefore, some non-representative genomes might exhibit greater distances to the nearest representative genome than the requested thresholds if they are indirectly connected to the representative genome.
CiDDER protein cluster saturation-based dereplication
CiDDER performs genomic dereplication by applying methods for large-scale protein clustering to achieve some user-defined saturation of the total coding pangenome space across the full genome set amongst selected representatives. Coding sequences on the edge of scaffolds or contigs are requested not to be reported when running Pyrodigal by default. Starting in v1.3.0, users can also provide proteomes directly to CiDDER as input to avoid gene-calling and the taxonomic constraints of Pyrodigal-based gene calling. After gene-calling, predicted protein sequences are concatenated and clustered using CD-HIT [25]. The key parameters controlling the greedy-based protein clustering done by CD-HIT are user-adjustable, but by default, CiDDER sets configurations to those that are sensible for dereplication of genomes belonging to a single species, for instance, by collapsing proteins at an identity threshold of 95% amino acid identity. Namely, the identity threshold is set to 95.0% with a word size of 5 used, an alignment coverage threshold of the shorter sequence by the longer sequence of 90% and an alignment coverage of the longer sequence by the shorter sequence of 75%. After determining protein clusters, the number of unique protein clusters per genome is computed. The genome with the largest number of protein clusters is selected as the initial representative genome. Then, in an iterative process, the next representative genome is selected based on which one contributes the most additional number of unique proteins, which have not been seen in a previously selected representative genome. This addition of representative genomes continues until one of three criteria is met: (i) the next genome adds less than X number of distinct protein clusters (X is by default 0), (ii) over Y% of the total distinct protein clusters across all genomes are found in the so-far selected representative genomes (Y is by default 90%) or (iii) over Z% of the total distinct multi-genome protein clusters across all genomes are found in the so-far selected representative genomes (Z is by default 100%). Thus, while all three parameters for controlling representative genome selection are user-adjustable, by default, only Y is used for representative genome selection. An option to request incremental selection of additional representative genomes from the remaining pool of non-representative genomes that lack minimal overlap of protein clusters, 90% by default, to at least one established representative genome was introduced in v1.3.0.
Auxiliary utility functionalities present in skDER and/or CiDDER
For convenience, we also implemented several auxiliary functionalities within skDER and CiDDER. For instance, users are able to automatically download all genomes belonging to a single bacterial species or genus in the Genome Taxonomy Database (GTDB) [430]. In addition, similar to other tools for genomic dereplication, both skDER and CiDDER allow users to request a secondary clustering of non-representative genomes to representative genomes, which will also report similarity metrics for pairs of genomes in the same cluster. For CiDDER, such clustering can involve either assessing similarity based on shared protein clusters or based on ANI. For skDER, a tiered approach is taken, whereby for each non-representative genome, its best matching representative is first selected based on the largest ANI from the subset of representative genomes that meet both the ANI and AF cutoffs requested. If no representative genome is close enough to the focal non-representative genome at the requested ANI and AF cutoffs, then the representative genome with the highest ANI is selected. If there are ties, AF is used as a secondary criterion for sorting and selecting the best representative genome.
Both skDER and CiDDER also feature an option to filter genomes for predicted MGEs using either PhiSpy [31], primarily designed for phage prediction – but run more loosely by not requiring any hallmark phages – or geNomad [32]. While geNomad is robustly designed to identify both plasmids and phages, leading to more accurate removal of such large MGEs, it takes more time to run computationally. Filtering of predicted MGEs from genomes is performed using mgecut, a standalone wrapper of PhiSpy and geNomad included in the skDER suite. Importantly, because it splits scaffolds along coordinates of MGEs, mgecut is applied in skDER after computing N50 so as not to penalize genomes with many MGEs and allow them to still serve as representatives. Finally, skDER features an option to assess the number of representative genomes that would result from using a variety of ANI or AF thresholds. This can be useful if users aim to select at most some number of representative genomes and would like to learn uniform cutoffs to achieve such a selection.
Assessment of the influence of MGE filtering on ANI and AF estimates
mgecut was run using geNomad (v1.8.0; database v1.7) [32] on all 1,902 E. faecalis genomes included in GTDB release 207^4^. Pairwise ANI estimates were calculated for E. faecalis genomes with and without MGE filtering using skani triangle [20].
Benchmarking computational resource usage of different genome dereplication programmes
We compared the performance of skDER and CiDDER for genomic dereplication against dRep (v3.5.0) [14] and galah (v1.4.0) [16] through application to genomes classified as Enterococcus (including Enterococcus_A and Enterococcus_B) in GTDB release 207 [4]. We ran dRep with Mash (v2.3) [18] for primary clustering and FastANI (v1.34) [7] for secondary clustering because the configuration is one of the most frequently used for genomic dereplication in recent years. CheckM [33] was skipped in dRep due to excessive time requirements. More recently, galah was developed and replaced the usage of MASH for primary clustering of genomes into coarsely related genomic groups with a faster alternative, Dashing [17]. However, given the advantages of skani for ANI estimations and performing preliminary screening to avoid more computationally intensive ANI computation between distantly related genomes, recent versions of galah have switched to using skani for both the preliminary and final clustering steps. We, thus, ran galah (v1.4.0) [16] with the recent defaults using skani code in the backend. For benchmarking, all programmes were run using 30 threads where possible, and resource use was measured using the Unix time command. To systematically reassess the distance of non-representative genomes to representative genomes determined by different methods, we recomputed ANI and AF between all pairs of genomes using skani triangle [20] run in ‘-slow’ mode for increased accuracy. The closest matching representative genome per non-representative genome was selected based on a score equivalent to the product of the non-representative AF and ANI.
A subset of 1,902 genomes corresponding to the species E. faecalis, based on GTDB R207 classification, was gathered, annotated using Prokka (v1.13) [34] and processed through Panaroo (v1.5.0) [35] for orthology inference. Importantly, Prokka was requested not to report coding sequences on the edge of scaffolds or contigs to avoid overinflating the pangenome with partial CDS features due to assembly fragmentation [36]. To define the phylogenetic breadth of the lineage, single-copy core ortholog groups were identified, and muscle super5 (v5.1) [37] was used to construct protein alignments for them. The protein alignments were next converted to codon alignments using PAL2NAL (v14.1) [38], which were then trimmed for sites that featured gaps in greater than 10% of sites using trimAl. Trimmed codon alignments were concatenated together and provided as input to FastTree2 (v2.1.11) [39] to construct a phylogeny using a generalized time-reversible model. The ete3 (v3.1.3) [40] package was used to extract subtrees corresponding to the different sets of representative genomes selected by the different dereplication programmes, with the ‘preserve_branch_length’ option set to True, and to measure the summed branch length of the subtrees.
Results
Convenient tools to select representative genomes for microbial comparative genomic analyses
skDER and CiDDER are both programs designed to perform dereplication of large microbial datasets (Fig. 1, Text S1, available in the online Supplementary Material). First, skDER leverages advances in fast and accurate ANI estimations by skani [20] and provides two memory-efficient approaches for processing skani’s results to determine a set of genomes that are well suited to serve as representatives. The two approaches within skDER are referred to as the ‘greedy’ algorithm and the ‘dynamic’ algorithm (Fig. 2a, b). The ‘greedy’ algorithm guarantees that all non-representative genomes have ANI and AF values that exceed the user-adjustable specified cutoffs to at least one representative genome. It functions similarly to approaches for genomic dereplication in other popular software such as dRep and galah [1416]. In contrast, the ‘dynamic’ algorithm approximates single-linkage clustering and more sparingly selects a single representative per coarse clustering of genomes. The ‘dynamic’ algorithm, thus, does not guarantee all non-representative genomes are within a certain ANI and AF distance from their closest representative genome.
Overview of algorithms for genomic dereplication. Overviews for the (a) skDER ‘greedy’ dereplication, (b) skDER ‘dynamic’ dereplication and (c) CiDDER dereplication approaches are shown.
Visualization of the dereplication of 66 Cutibacterium avidum genomes using skDER and CiDDER with granet. The programs skDER and CiDDER were run to download and dereplicate 66 genomes classified as C. avidum in GTDB release 220. Afterwards, the granet tool was used to generate network visualizations for the (a) skDER greedy, (b) skDER dynamic and (c) CiDDER runs. Nodes correspond to different genomic assemblies, and edge connections represent pairs of genomes exhibiting >99 % ANI and a maximal AF >90 %. Genomes selected as representatives are shown in red.
In CiDDER, comprehensive protein clustering using CD-HIT [25] is applied to minimize the number of representative genomes needed to maximize sampling of the coding pangenome space (Fig. 2c). This is performed via an iterative approach in which genomes that contribute the most additional protein clusters, not accounted for in previously selected representative genomes, are prioritized to serve as the next representative. By default, the algorithm terminates when 90% of the total number of distinct protein clusters in the coding pangenome space are sampled. Users can adjust this cutoff and apply other criteria for controlling the selection of additional representative genomes.
Beyond dereplication, skDER and CiDDER both offer users a variety of options for convenience and additional functionality. For instance, they feature an option to automatically download genomes from National Center for Biotechnology Information’s GenBank database belonging to a single species or genus based on GTDB classifications [43041] and perform dereplication on them. They also feature options to perform secondary clustering of non-representative to representative genomes. Clustering reports feature detailed information such as the ANI and AF of the non-representative genome to its closest representative genome or the percentage of shared protein clusters between them. In addition, skDER features an option to test combinations of different ANI and AF cutoffs to assess how many representative genomes are selected when they are applied (Fig. S1). This option might be of interest to users when they aim to select at most a certain number of genomes as representatives.
Functionalities for visualizing the relationship between genomes and assessing the effects of MGEs
The skDER package also includes two standalone scripts: granet and mgecut. The granet program can be used with results from skDER or CiDDER for creating network visualizations, which summarize the dereplication process (Fig. 2). In such networks, nodes represent genomes and edges are shown between genome pairs that are similar enough to meet ANI and AF cutoffs. Nodes are coloured red if they are representative genomes and black otherwise. The mgecut program aims to filter large MGEs, namely plasmids and viruses, from genomes prior to computing AF estimates. It is a simple wrapper program that runs and processes the results of either PhiSpy [31] or geNomad [32], two programs for MGE annotation, and then cuts out MGE coordinates in the input genome. mgecut is integrated within skDER and CiDDER and can be requested to process genomes prior to ANI estimation or protein clustering.
To assess the effect of MGEs on ANI, we processed 1,902 E. faecalis genomes through mgecut using geNomad [32]. E. faecalis was selected because it is known to be rich in MGEs, which have shaped the evolution of the species [4243]. skani [20] was next used to infer ANI and AF estimates between pairs of unaltered genomes and separately between pairs of MGE-filtered genomes. ANI estimates between corresponding pairs of genomes were compared to reveal that MGEs tended to reduce ANI between genomes, with a median difference of −0.05 (Fig. 3a). Similarly, AF values were also generally less similar between pairs of genomes when MGEs were included, with a median difference of −3.43 (Fig. 3b). In addition, Rodriguez-R et al. recently reported that an intra-species ANI gap is consistently found and proposed a cutoff of at least 99.99% ANI to distinguish strains [2944]. We find that this ANI gap is still present after removal of MGEs (Fig. S2).
Assessment of MGE effects on ANI and AF between pairs of E. faecalis genomes. The difference between unprocessed and mgecut processed genomes for (a) ANI and (b) AF estimates is shown. The dashed line in red indicates the median value. (c) The relationship between ANI computed using unprocessed genomes and ANI computed using mgecut processed genomes where each point corresponds to a pair of genomes. The dashed blue line provides a one-to-one reference. Note, because ANI is symmetric while AF is not, there are twice as many genome pairs considered in panel (b) in comparison to panels (a) and (c).
Comparison of skDER to other ANI-based approaches for representative genome selection across Enterococcus
We compared the performance of skDER and CiDDER for representative genome selection across the diverse genus of Enterococcus [45] with two other ANI-based dereplication programs, dRep and galah, which are primarily designed for selecting reference genomes for mapping metagenomic readsets. The dynamic and greedy-based approaches implemented in skDER, as well as CiDDER, dRep and galah, were independently applied to a set of 5,291 low to high-quality Enterococcus genomes from GTDB [4], including metagenome assembled genomes (MAGs). FastANI [7] was requested for secondary clustering in dRep, whereas skani was used for ANI inference and screening in galah and skDER. For all ANI-based methods, dereplication was performed using an ANI threshold of 99% with either a coverage or AF cutoff of 25% or 90%.
One of the most common approaches taken for the task of genomic dereplication is running dRep with FastANI for secondary clustering. Importantly, this method was designed for application on MAGs. When applied to a dataset of MAGs from a common microbiome, the divide-and-conquer approach it implements to first partition MAGs into coarse bins and, afterwards, perform more precise ANI inference separately within each partition of genomes is extremely efficient. However, in our application of dRep on Enterococcus, only two partitions were identified – leading to extensive FastANI computations that took considerably longer than skDER and galah, which were run using skani for accurate ANI inference instead. Thus, much of the speed boost for galah and skDER is from the use of skani, which is substantially faster than FastANI [20] (Fig. 4a). Dereplication with dRep also required greater memory than galah, skDER and CiDDER; however, all methods used less than 18 GB of memory (Fig. 4b). The quality of representative genomic assemblies selected by the different methods, measured using the N50 statistic, appeared largely similar (Fig. S3).
Comparisons of skDER and CiDDER to other dereplication approaches. Bar plots show the (a) runtime (based on using 30 threads), (b) maximum memory consumption and (c) total representative genomes selected from dereplication of 5,291 Enterococcus genomes by different methods. The time and memory needed to run CheckM2 are not accounted for in the plots for two galah dereplication investigations where CheckM2 information was incorporated. (d) The proportion of non-representative genomes that meet ANI and AF criteria in relation to a nearby representative genome. Neither indicates either ANI is <99% or AF is <25%. While dereplication methods that used skani ran it with largely default settings, pairwise ANI and AF estimates used for this assessment were obtained from running skani in ‘-slow’ mode for increased accuracy.
Surprisingly, despite being run with similar thresholds for ANI and AF, the three software selected a discordant number of representative genomes (Fig. 4c). For instance, while skDER in greedy dereplication mode selected 807 representatives at the cutoffs of 99% ANI and 90% AF, galah with information on genome quality from CheckM2 and dRep selected 630 and 1,805 representative genomes at the same cutoffs, respectively. Further investigation revealed that while galah requires all non-representative genomes to exhibit an ANI equivalent to or greater than the cutoff to a representative genome, no strict enforcement is applied for the AF cutoff (Figs 4d and S4). This explains why skDER selects more representative genomes than galah, despite similar parameters and algorithms, since, in greedy dereplication mode, skDER requires both ANI and AF of non-representative genomes to meet the specified cutoffs in relation to one or more representative genomes. The substantial increase in the selected number of representative genomes by dRep in comparison to skDER and galah when using similarly stringent cutoffs for ANI and AF might be due to underlying differences between FastANI and skani in how genomic similarity metrics are estimated.
In addition, all methods for dereplication selected at least one genome representative for each Enterococcus species in GTDB [4]. Therefore, one of the key differences between the four methods was the number of representative genomes selected for highly sequenced species in the genus, in particular E. faecalis. We further investigated the coverage of the E. faecalis pangenome by representative genomes selected using skDER, CiDDER, dRep and galah. Panaroo [35] was used to group genes into ortholog groups for all selected E. faecalis genomes. The granularity of representative genome selection was similar for each dereplication method when applied to E. faecalis, as was previously observed in their application to the entire genus (Fig. 5). Because of the stringent requirement for AF cutoffs, skDER run in greedy mode selected more representative genomes than galah under similar parameter configurations but accounted for considerably more distinct ortholog groups. While the rate of distinct ortholog group discovery per representative genome selected was 39 between galah runs using 25% AF and 90% AF cutoffs, the rate was largely retained between the galah run with a 90% AF cutoff and skDER run in greedy mode with a 90% AF cutoff at 32.5. This indicates that rate saturation had not occurred and that the additional count of genomes included as representatives by skDER substantially increased coverage of the E. faecalis pangenome. In contrast, the rate of distinct ortholog groups discovered between skDER greedy run with a 90% AF cutoff and dRep run requesting a 90% AF cutoff, which selected the highest number of representative genomes, was just 6.5. CiDDER – when run with default parameters, which requests to select representatives until 90% of the total protein clusters observed across all genomes have been sampled – performs similarly to skDER run in greedy dereplication mode with cutoffs of 99% ANI and 90% AF. Further, while CiDDER sampled a slightly greater number of ortholog groups per representative genome, the skDER run covered a wider phylogenetic breadth across representative genomes, measured as the sum of lengths for relevant branches along a core genome phylogeny. These trends are expected since ANI-based dereplication should be more sensitive to variation in the core genome used for phylogeny inference.
Overview of E. faecalis representative selections by different dereplication software. Scatterplots show the (a) number of ortholog groups determined by Panaroo and (b) the phylogenetic breadth of the core genome for 1,902 E. faecalis genomes as a function of the number of representative genomes selected by different dereplication methods.
Discussion
Here, we introduced skDER, a convenient tool that leverages recent advances in fast yet accurate ANI estimation by skani to perform genomic dereplication. We also described CiDDER, an independent program that performs genomic dereplication based on the user’s desired saturation of the non-redundant coding space observed across all genomes. Both tools are primarily designed to reduce genomic datasets for downstream comparative genomic investigations.
In the last decade, ANI-based methods for genomic dereplication have primarily been used for metagenomic applications to select representative MAGs for bacterial species or strains to then align reads to and assess presence across related microbiomes [1416]. While skDER and CiDDER are designed primarily to assist with genomic selection for comparative studies of microbial species or genera, they can also be useful for metagenomic studies. skDER’s dynamic dereplication approach might be well-suited for metagenomic applications where selection of fewer representative genomes should alleviate concerns that reads belonging to the same strain get falsely partitioned across closely related genomes. In contrast, CiDDER and the greedy dereplication mode of skDER can be used to ensure representative genomes capture the breadth of a taxon’s pangenome since they are more sensitive to distinctions in auxiliary gene content between genomes. These methods could thus be particularly useful for the selection of genomes in comparative studies aiming to account for variably conserved traits that are associated with particular lineages or habitats [46]. Ultimately, whether a more comprehensive or concise selection of representative genomes is appropriate will likely depend on the downstream metagenomic analyses being performed.
A key limitation of skDER and CiDDER relative to dRep and galah for metagenomic applications is their inability to incorporate potentially available information on genomic contamination [14163347]. While the extent of contamination is an important consideration when selecting representatives amongst MAGs in particular [48], a simple solution we recommend is for users to assess results from CheckM2 [47] or other software and remove highly contaminated genomes prior to genomic dereplication. Alternatively, users should consider the use of charcoal [49] or similar tools that enable filtering genomic assemblies for specific scaffolds or contigs exhibiting indications of being contaminants prior to running skDER or CiDDER instead of disqualifying entire MAGs from serving as representatives. Contamination within genomes becomes more difficult to resolve when it involves multiple strains belonging to the same species [1447]. Ultimately, further benchmarking is needed to compare the usage of skDER and CiDDER to dRep and galah for metagenomic applications.
A recent study found that, on average, 100 bp differences between strains result in distinct ecological dynamics within microbiomes [50]. This corresponds to an ANI of 99.998% identity if we assume an average genome size of five million bp and supports proposals by Rodriguez-R et al. to use a stringent ANI cutoff of 99.99% to distinguish different strains [29]. However, some k-mer-based ANI estimation methods might exhibit difficulties in achieving such precision, especially when applied to fragmented genomes [1420]. Furthermore, because MGEs, such as plasmids, are less stably integrated within genomes, users might prefer to filter them prior to dereplication or determining strain clusters. For one species, we observed that such MGEs usually lower the ANI between strains. This is likely because they can evolve much faster than chromosomes, especially with regard to syntenic structure [5153], and can, in a nested fashion, contain smaller MGEs, such as transposons, themselves [5455]. skDER and CiDDER both have options to automatically remove such MGEs, but their application can become expensive when dereplicating thousands of genomes.
In summary, skDER and CiDDER aim to serve as convenient, open-source tools for performing genomic dereplication to simplify downstream comparative genomic investigations. Low computational resource requirements to dereplicate over five thousand Enterococcus genomes highlight how they can be used to select diverse representative genomes that can then be investigated with user-friendly tools or web applications for comparative genomic investigations [35,5662] on generally accessible infrastructure, such as laptops.
Supplementary material
10.1099/mgen.0.001438Uncited Supplementary Material 1.
10.1099/mgen.0.001438Uncited Supplementary Material 2.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Salamzade R Sk DER Zenodo 202410.5281/ZENODO.13887710 · doi ↗
- 2Grüning B Dale R Sjödin A Chapman BA Rowe J et al Bioconda: sustainable and comprehensive software distribution for the life sciences Nat Methods 20181547547610.1038/s 41592-018-0046-729967506 PMC 11070151 · doi ↗ · pubmed ↗
- 3Salamzade R Kalan L Sk DER & CIDDER - manuscript data & code Zenodo 202410.5281/ZENODO.13891800 · doi ↗
- 4Parks DH Chuvochina M Rinke C Mussig AJ Chaumeil P-A et al GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy Nucleic Acids Res 202250 D 785D 79410.1093/nar/gkab 77634520557 PMC 8728215 · doi ↗ · pubmed ↗
- 5Salamzade R Kalan L Sk DER representative genomes for select bacterial taxa Zenodo 2023
- 6Goris J Konstantinidis KT Klappenbach JA Coenye T Vandamme P et al DNA-DNA hybridization values and their relationship to whole-genome sequence similarities Int J Syst Evol Microbiol 200757819110.1099/ijs.0.64483-017220447 · doi ↗ · pubmed ↗
- 7Jain C Rodriguez-RLM Phillippy AM Konstantinidis KT Aluru S High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries Nat Commun 20189511410.1038/s 41467-018-07641-930504855 PMC 6269478 · doi ↗ · pubmed ↗
- 8Olm MR Crits-Christoph A Diamond S Lavy A Matheus Carnevali PB et al Consistent metagenome-derived metrics verify and delineate bacterial species boundaries m Systems 20205 e 00731-1910.1128/m Systems.00731-1931937678 PMC 6967389 · doi ↗ · pubmed ↗
