How to measure bacterial genome plasticity? A novel index helps gather insights on pathogens

Greta Bellinzona; Gherard Batisti Biffignandi; Matteo Brilli; Fausto Baldanti; Davide Sassera; Stefano Gaiarsa

PMC · DOI:10.1099/mgen.0.001459·August 4, 2025

How to measure bacterial genome plasticity? A novel index helps gather insights on pathogens

Greta Bellinzona, Gherard Batisti Biffignandi, Matteo Brilli, Fausto Baldanti, Davide Sassera, Stefano Gaiarsa

PDF

Open Access

TL;DR

A new index measures how quickly bacteria gain or lose genes, linking genome plasticity to the spread of dangerous pathogens.

Contribution

A novel index incorporating evolutionary distance and gene transfer events to measure bacterial genome plasticity.

Findings

01

The index reveals distinct plasticity patterns in high-risk bacterial clones.

02

The method accounts for contiguous gene transfers in single events.

03

Results suggest genome plasticity correlates with clinically significant bacterial strains.

Abstract

Genome plasticity can be defined as the capacity of a bacterial population to swiftly gain or lose genes. The time factor plays a fundamental role in the evolutionary success of microbes, particularly when considering pathogens and their tendency to gain antimicrobial resistance factors under the pressure of the extensive use of antibiotics. Multiple metrics have been proposed to provide insights into the gene content repertoire, yet they overlook the temporal component, which has a critical role in determining the adaptation and survival of a bacterial strain. In this study, we introduce a novel index that incorporates evolutionary distance to assess the rate at which bacteria exchange genes, thus fitting the definition of plasticity. Opposite to available indexes, our method also takes into account the possibility of contiguous genes being transferred together in one single event. We…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species3

Klebsiella pneumoniae Staphylococcus aureus Escherichia coli

Figures5

Click any figure to enlarge with its caption.

Fig. 1 — Strategy applied to compute the total number of gene gain/loss events considering different scenarios: (a) identical genomes, (b) single-gene indels, (c) multiple consecutive gene indels, (d) multiple non-consecutive gene indels and (e) translocations.

Fig. 2 — Robustness of FOGS to dataset characteristics. (a) The stability of the FOGS index (y-axis) was assessed by varying the number of genomes in the dataset (x-axis), ranging from 10 to 1,000. For each dataset size, genomes were randomly selected ten times, and the results are represented as boxplots. The orange points represent the actual FOGS values. (b) The influence of genome fragmentation on the accuracy of FOGS (y-axis) calculations was evaluated by selecting two random sets of 100 genomes and simulating different levels of fragmentation (x-axis). (c) The effect of dataset inhomogeneity, represented by the sd of the number of contigs (x-axis), on FOGS values (y-axis) was analysed. This highlights the relationship between variability in genome completeness and FOGS reliability.

Fig. 3 — Pointplot FOGS within each cluster identified by fastBAPS [25] in K. pneumoniae. Vertical lines represent the CIs.

Fig. 4 — Pointplot FOGS within each cluster identified by fastBAPS [25] in S. aureus. Vertical lines represent the CIs.

Fig. 5 — Pointplot FOGS within each cluster identified by fastBAPS [25] in E. coli. Vertical lines represent the CIs.

Tables1

Table 1.. Number of genomes and clusters at each stage of test datasets' construction

Species	HQ genomes	MLST*	dRep^Δ	Number of clusters^&	Number of genomes^#
K. pneumoniae	17,887	17,435	11,196	19	1,900
S. aureus	18,849	18,188	11,319	20	2,000
E. coli	42,049	40,805 (#1, Atchman)	22,099	35	3,500

Equations3

Funding2

—http://dx.doi.org/10.13039/501100003196 Ministero della Salute
—NextGenerationEU-MUR PNRR Extended Partnership initiative on Emerging Infectious Diseases

Keywords

genome plasticitygenomicsevolution

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Evolution and Genetic Dynamics · Bacterial Genetics and Biotechnology

Full text

Data Summary

The authors confirm that all supporting data, code and protocols have been provided within the article or through supplementary data files.

Data Availability

The scripts used to compute FOGS are available at https://github.com/MIDIfactory/Genome-Plasticity.

Introduction

The evolutionary success of an organism hinges on its capacity to continuously adapt to changes encountered within the environment. Prokaryotes are notably characterized by the ability to acquire exogenous DNA through horizontal gene transfer (HGT) [1] and easily lose unnecessary genes [2], both representing powerful adaptation tools. The balance between gene gain and loss events constitutes a complex trade-off [3]: whilst the acquisition of genes enables the emergence of novel functions, it also may impose a metabolic burden on the organism, demanding energy and resources for the synthesis and maintenance of the acquired genes. Whether a gene is kept is determined by the equilibrium between the fitness advantages within the specific environment and the metabolic cost associated. For instance, the presence of antibiotic resistance genes introduces a trade-off whereby a resistant strain will outcompete susceptible bacteria in the presence of antibiotics, but the metabolic burden in the absence of antibiotics will disadvantage the resistant strain [4 6]. These costs can be highly variable, with some resistance genes imposing little to no fitness cost or even providing a benefit [7].

Not only is the presence of such mechanisms fundamental for the evolution of a bacterial population, but the time component plays a key role in determining their effectiveness. The capacity of a bacterial community to swiftly gain or lose genes, termed genome plasticity, is an important factor contributing to its evolutionary success.

In the era of massive genomics data, investigating genome plasticity on both large and short timescales has become possible. To do so, however, ad hoc tools must be designed. Over the years, several metrics have been proposed to assess genome plasticity and offer insights into the dynamics of genetic exchange amongst bacteria, such as Jaccard distance applied to gene content [8 9]. Jaccard distance, a dissimilarity measure widely used in machine learning and computational genomics [10], can be applied to compare gene content across different sets of genomes using the following formula:

[eqn]

where U_A_ and U_B_ are the number of genes found only in genomes A and B, respectively, and M_A_ and M_B_ are the total number of genes found in A and B, respectively; S_A,B_ are the shared genes between A and B, and N is the total number of pairs considered.

In this context, Jaccard distance allows us to compare the overall gene repertoire across different groups of genomes. Moreover, Jaccard distance applied to gene content resembles the genome fluidity formula [11]:

[eqn]

Although the difference is slight, when Jaccard distance is used to compare gene content between two genomes, it assigns lower importance to core genes than genome fluidity. This is achieved by preventing the double counting of shared gene families between the two genomes. Genome fluidity was proposed by the authors as an alternative to core and pan genome for measuring gene content diversity within a species or groups of closely related organisms [11].

Neither Jaccard distance nor genome fluidity takes into account the time component, which, as previously stated, is a key component in defining genome plasticity. As a consequence, by applying these two indices to gene content, it is not possible to distinguish between a scenario where a bacterial community gradually accumulates a large gene repertoire over an extended period and a second scenario where the population undergoes rapid gain or loss events. The latter is a distinctive feature of emerging high-risk clones in many bacterial pathogen species, one commonly associated with increased virulence, enhanced transmissibility and extensive antibiotic resistance [12 14]. Additionally, both indices consider genes as single entities, whereas HGT events often involve regions with more than one gene. Consequently, even with a time correction, the number of gene gain or loss events may be overestimated. Alternative methods such as Panstripe [15] have been developed to model gene content evolution along a dated phylogeny, offering a more nuanced approach by integrating branch lengths and accounting for the temporal structure of the data. These require a reliable, time-calibrated phylogeny, which in turn depends on the availability and accuracy of sampling dates for all isolates. In many real-world datasets, such temporal metadata are incomplete, inconsistently annotated or simply unavailable, limiting the applicability of these methods.

In this study, we aimed at providing a novel metric to assess the rate at which bacteria exchange genes. We implemented a new method to compute the number of gene gain and loss events and introduced the evolutionary distance into the calculation, which does not require a time-calibrated phylogeny. Then, we used the new index to investigate the dynamics of bacterial genome plasticity within three widespread, widely studied bacterial species: Klebsiella pneumoniae, Staphylococcus aureus and Escherichia coli.

Methods

Flux of Gene Segments

In order to compute Flux of Gene Segments (FOGS), we followed the same approach of Brilli et al. [16] by translating each genome into a graph. (1) PanTA [17] was used to classify all the proteins from each species dataset into orthology groups. (2) The gene neighbourhood network of each genome was built using the information about the location of coding genes in the annotation file previously produced by Prokka [18]. Each gene was considered adjacent to the one downstream in the genome table only if their distance was less than 1,000 bp and if they belonged to the same contig. This threshold was chosen as a practical cutoff, since genes within this distance are often functionally related or part of the same operon. Limiting the adjacency to genes on the same contig ensures that only physically neighbouring genes within a single genomic region are considered, reducing the potential for errors due to sequencing artefacts or assembly issues. We then performed the pairwise graphs (genomes) comparison as follows: (3) for each genome, the genes that are unique to the considered pair are identified (4) and compressed if consecutive, considering them as a single element. (5) The sum of the unique gene segments obtained corresponds to the number of gene gain or loss events. To this purpose, we used the connected_components function included in the SciPy Python library [19]. (6) The number of gene gain or loss events was then weighted on the SNP distance, computed as previously described. All the steps were performed using an in-house Python script.

Datasets construction

A curated dataset of high-quality genomes was collected from BV-BRC [16] (updated December 2022) independently for K. pneumoniae, S. aureus and E. coli using the makepdordb.py script of the P-DOR pipeline [17], which automatically filters for genome size and number of contigs. To further improve the quality of each dataset, we performed in silico MLST, using schemes downloaded from PubMLST [19] in December 2022. For E. coli, we chose the Achtman scheme. Since MLST is based on single-copy housekeeping genes, the absence, or the presence of more than one copy, of one of these genes is most likely owing to a poor-quality genome assembly. Only genomes that could be assigned a sequence type (ST), using an in-house Python script, were used in the next phase.

To reduce redundancy (e.g. to avoid almost identical genomes that were obtained to analyse outbreaks) from each dataset, dRep [20] was applied using the ‘dereplicate’ function (-ms 10000 pa 0.99 --SkipSecondary), also allowing to confirm the quality of the genome using its internal default pipeline.

An SNP alignment was obtained for the reduced datasets using P-DOR (-n 0) [17] and then used as input for fastBAPS [21] with default settings. After eliminating any fastBAPS-assessed clusters with less than 100 genomes, a random selection of 100 genomes was chosen from each cluster that was still present to ensure representativeness. We named each cluster with the prevalent ST; when an ST was split between clusters, we added a letter to the name.

All the genomes included in the analyses are now listed in Table S1, available in the online Supplementary Material.

SNP distances

An SNP alignment was generated on the final datasets using P-DOR (-n 0) [17] with an internal complete genome as reference for each species (E. coli: CP043539; S. aureus: LT963437; K. pneumoniae: CP006648). To reconstruct full genome alignments, we used an in-house Python script that inserted the identified SNPs into the reference genome, producing a consensus sequence for each isolate. These consensus sequences were then aligned to form a whole-genome alignment, which was used as input for Gubbins [20] (default parameters) to detect and remove putative recombinant regions.

The resulting recombination-filtered SNP alignment was then used to compute pairwise SNP distances using the snp-dists tool (https://github.com/tseemann/snp-dists).

Assessing FOGS robustness

To evaluate the effect of the number of genomes on FOGS, we randomly sampled N genomes from the K. pneumoniae BV-BRC dereplicated dataset, with N ranging from 10 to 1,000. Between N=10 and N=100, selections were made in steps of 10, whilst larger datasets (N>100) were selected in steps of 100. For each N, genomes were sampled randomly, with repetitions performed ten times to ensure robustness across replicates. This bootstrap approach allowed us to evaluate the stability of FOGS estimates across multiple iterations. Then, to evaluate the influence of genome fragmentation on FOGS, we focused on closed genomes (i.e. those assembled into a single contig) and artificially fragmented them. We selected two sets of 100 complete genomes, with no overlap. Fragmentation was simulated by splitting each genome into N pieces (from 10 to 1,000) of varying sizes, with fragment sizes being uniform within each genome. That is, for a genome of length L and a chosen N, each fragment was approximately of length L/N. This was done using an in-house Python script. This approach ensured that only the contiguity of the genome was altered, whilst the total number of genes and gene sequences remained unchanged. Finally, to investigate the effects of dataset inhomogeneity, we computed FOGS across curated 25 subsets of 100 genomes, randomly taken from the initial dereplicated dataset. These subsets were selected to vary systematically in their levels of inhomogeneity by adjusting the proportion of complete fragmented genomes within each dataset.

Resistance and virulence genes

To assess the presence of resistance genes, we utilized ResFinder [21]. Our criteria for determining gene presence required a minimum of 60% query coverage and 90% sequence identity. Similarly, we assessed the presence of virulence genes using the Virulence Finder Database (VFDB) [22]. Genes encoding proteins for the same virulence function (e.g. operons for the synthesis of siderophores) were grouped using VFDB classification. Analyses on K. pneumoniae were repeated using Kleborate [23].

To ascertain the presence and variant of fimH, a dedicated BLASTn search was performed using the fimtyper database (bitbucket.org/genomicepidemiology/fimtyper_db) as reference (100% query coverage and 100% sequence identity).

Statistical analysis

The mean distribution of the number of gene gain/loss events weighted by SNP distances (X_A,B_/d_A,B_) between fastBAPS clusters was tested using the Kruskal–Wallis test, followed by Dunn’s test for pairwise comparisons using Benjamini–Hochberg adjustment. The analyses were performed using R v4.1.3.

The CIs for FOGS values were estimated through Seaborn’s pointplot function, using the error=ci argument.

Results and discussion

A novel index: FOGS

To assess genome plasticity, here, we introduce a novel index called FOGS. This new index is designed to align with the aforementioned definition of genome plasticity, representing the rate of events of gene acquisition or loss. To achieve this, we developed a new method for calculating the number of such events and integrated the time component to express it as a rate.

To address the challenge of accounting for gene gain or loss events, we follow the most parsimonious hypothesis. Thus, when multiple neighbouring genes are gained or lost, we count a single event rather than distinct ones. This approach simplifies the reconstruction of gene flux whilst ensuring that our results remain computationally feasible, especially in the context of large-scale datasets. We must recognize that, in seeking simplification, our method may not capture the full complexity of gene dynamics, particularly in cases where more complex evolutionary events occur. Our approach to calculate the number of events is presented in Fig. 1. We represented each genome into a graph connecting consecutive genes, akin to the method employed in the computation of the Genome Organization Stability index [16]. When two genomes are compared, shared genes are excluded from the analysis, leaving only unique genes to be considered. This approach eliminates the influence of internal rearrangements, which would otherwise introduce noise into the calculation, as such rearrangements do not impact the number of gene gain or loss events relevant to the index. Exploiting tables of gene coordinates, if genes are consecutive, they will be compressed and considered as single entities. By summing the total number of these unique gene stretches, we calculate the putative number of gain or loss events in genes or sets of consecutive genes occurring between the two genomes (Fig. 1). It is important to highlight that this approach cannot distinguish whether an event represents a gain in one genome or a loss in the other. Instead, the focus is on the net result of these changes. Detailed technical explanations of the methodology are provided in the Methods section.

Strategy applied to compute the total number of gene gain/loss events considering different scenarios: (a) identical genomes, (b) single-gene indels, (c) multiple consecutive gene indels, (d) multiple non-consecutive gene indels and (e) translocations.

Once the number of putative gene gain or loss events is determined for each pairwise comparison, we then scale by dividing the calculated value by the evolutionary distance between the genomes. For measuring evolutionary distance, we chose to use the core SNP distance. This decision was guided by our focus on applying the index to closely related strains, particularly in the context of intraspecific analyses. Core SNPs are particularly useful for evolutionary purposes since they preserve a higher discriminant power compared, for instance, to single-copy orthologues [24]. Core SNPs may not be appropriate for studies involving entire species or comparisons across different species. In such cases, alternative approaches for measuring evolutionary distance, such as average nucleotide identity or phylogenomic methods based on single-copy orthologues, may be more suitable.

Based on the above, for a group of N genomes, FOGS is defined as follows:

[eqn]

where X is the number of gene gain/loss events, d is the SNP distance between genome A and genome B and N is the number of genomes considered. The higher the value of FOGS, the higher the genome plasticity.

Assessing the robustness of FOGS

To ensure the reliability and applicability of the novel genomic plasticity index, we evaluated its sensitivity to key dataset characteristics, including the number of genomes and the level of genome fragmentation. Specifically, we examined how variations in these factors influence the accuracy and robustness of FOGS, identifying potential biases or limitations that could arise under different dataset conditions. First, we tested datasets of varying sizes to determine the minimum sample requirements needed to produce consistent and meaningful results. The number of genomes in a dataset directly influences both the number of pairwise comparisons and the overall diversity captured in the analysis. To evaluate the impact of dataset size on FOGS, we examined multiple randomly selected datasets of varying sizes, ranging from small (N=10) to larger populations (N=1000). Our analysis demonstrated that the FOGS index exhibits stability when applied to medium and large datasets, with the index reaching a plateau at around N=90 genomes. Increasing the number of genomes beyond this threshold does not substantially alter the outcome (Fig. 2a). This finding highlights the critical importance of ensuring an adequate sample size to obtain robust and reproducible estimates of genomic plasticity, whilst also minimizing the influence of random variability. It also implies that smaller datasets may lead to less stable estimates, and including enough genomes in the analyses is highly recommended to avoid potential biases associated with under-sampling.

Robustness of FOGS to dataset characteristics. (a) The stability of the FOGS index (y-axis) was assessed by varying the number of genomes in the dataset (x-axis), ranging from 10 to 1,000. For each dataset size, genomes were randomly selected ten times, and the results are represented as boxplots. The orange points represent the actual FOGS values. (b) The influence of genome fragmentation on the accuracy of FOGS (y-axis) calculations was evaluated by selecting two random sets of 100 genomes and simulating different levels of fragmentation (x-axis). (c) The effect of dataset inhomogeneity, represented by the sd of the number of contigs (x-axis), on FOGS values (y-axis) was analysed. This highlights the relationship between variability in genome completeness and FOGS reliability.

We then investigated the impact of genome fragmentation, due to draft assemblies, on defining gene content and adjacency. Draft assemblies present a challenge for the FOGS index because they often consist of fragmented genomes represented by contigs, which are short, incomplete sequences of DNA that are not yet fully assembled into a complete genome. When working with contigs, the order of genes is not always clear because the contigs are isolated segments that may not be placed in their correct relative positions in the genome. This lack of information about the gene order can lead to inaccuracies in calculating gene gain or loss events, as the adjacency between genes is a key component of how the index measures genomic plasticity. As a result, gene adjacency could be misrepresented, and the FOGS index might incorrectly assess gene gain or loss, as it is unable to account for the true order of the genes. To evaluate the impact of fragmentation, we simulated varying levels of fragmentation by dividing complete genomes into smaller contigs of different lengths. In Fig. 2b, two distinct populations are observed, with FOGS values increasing as fragmentation levels rise. This indicates that FOGS calculations for these artificially fragmented datasets overestimated gene gain/loss events when compared to the complete genomes, and the higher the fragmentation, the higher the overestimation (Fig. 2b). This highlights the importance of preprocessing datasets to remove low-quality draft genomes whenever possible, to ensure more accurate and reliable FOGS calculations. In real-world applications, genomic datasets often present varying degrees of completeness and fragmentation, which can influence the reliability of FOGS. Our analysis showed that when fragmentation levels are relatively uniform within a dataset, the associated error follows a predictable linear relationship with the degree of fragmentation (Fig. 2b). This consistency allows for straightforward comparisons between groups even with highly fragmented genomes. However, in datasets with high variability, particularly those dominated by fragmented genomes, we observed a significant increase in the variance of FOGS results (Fig. 2c). To reduce these issues, we recommend avoiding comparing datasets with significantly different levels of fragmentation, since this will bias the index. With the current advancements in sequencing technology, the number of genomes and their quality are no longer the limiting factors they once were. Today, high-throughput sequencing methods and improved assembly techniques have made it possible to generate large numbers of high-quality genomes with reduced fragmentation. This significantly mitigates the impact of incomplete genomes and enhances the representativeness of datasets. As a result, large genomic populations can now be analysed with greater reliability, enabling more accurate estimations of genomic plasticity, even in studies involving diverse genomic sets.

Application of plasticity indices to large genome datasets

We applied FOGS to investigate plasticity trends within three species of clinical interest. We choose as case studies K. pneumoniae, which is a well-known nosocomial and community-acquired bacterium often associated with multidrug resistance, S. aureus, which is both nosocomial and community-acquired, and E. coli, known for its wide repertoire of ecological niches. For each organism, we identified prominent taxonomic clusters using the fastBAPS algorithm [25] from a comprehensive collection of high-quality genomes accessible through the BV-BRC database [26]. Then, we purged nearly identical genomes (i.e. outbreaks) and selected a representative subset of 100 genomes from each of the most prominent clusters (>100 genomes). For convenience of interpretation, we labelled each cluster with the most represented multi-locus sequence type (MLST or ST) amongst the genomes. See Table 1 for an overview of the results obtained at each stage of the filtering process (for technical details, see the Methods section).

We applied the newly created index FOGS to the three species datasets in order to investigate the presence of different plasticity patterns amongst the clones. Initially, we considered the presence of a relationship between the number of gene gain or loss events versus the number of SNPs for each pairwise comparison (Fig. S1) within each species. Although a general relationship was observed – indicating that a higher SNP distance corresponds to a higher number of potential gene gain/loss events – there is a noticeable clustering at low SNP distances. This clustering reveals a broad spectrum of putative events in this range. Also, the relationship observed is non-linear.

In addition, we investigate a possible association between FOGS and virulence or resistance traits by comparing the index to the mean number of resistance and virulence genes for each cluster. As follows, we discuss the results obtained for each species in depth.

K. pneumoniae

K. pneumoniae is one of the leading causes of hospital-acquired infections, especially amongst immunocompromised individuals, elderly patients and those with chronic conditions. Within the hospital environment, this bacterium can persist on various surfaces, on medical equipment and in water sources. Thus, it poses a significant challenge in terms of antibiotic resistance, making it a persistent concern for healthcare settings [27]. The K. pneumoniae population shows a remarkable level of diversity, encompassing numerous distinct phylogenetic lineages or ‘clones’ [28]. These lineages can be defined as multidrug resistant (MDR), or hypervirulent based on the prevalence of determinants of resistance to antibiotics or virulence genes.

Wyres et al. analysed the gene content diversity between MDR and hypervirulent clones and found a reduced genetic diversity in the hypervirulent ones. They achieved this result by calculating the pairwise Jaccard gene content distances amongst genomes belonging to a clone [8]. In our study, we focused on evaluating the rate at which genome content evolves. Most of the clusters in our dataset, including STs 681, 13, 218, 348, 231, 395, 65 and 29, displayed a generally low gene flow. However, a smaller subset of clusters, exclusively represented by STs 307, 101 and 15, exhibited higher values, suggesting greater plasticity (Fig. 3).

Pointplot FOGS within each cluster identified by fastBAPS [25] in K. pneumoniae. Vertical lines represent the CIs.

Clusters at higher FOGS levels (307, 101, 15 and 147) correspond to STs that are widely recognized globally as emerging high-risk clones, due to their potential to cause severe infections and their association with antimicrobial resistance (AMR), including pan resistance [12,29 31]. Notably, ST307 has been gradually displacing the well-known ST258 in multiple areas of the world (e.g. [12]). Readers should take into account that the cluster we named ‘258’ encompasses ST11, ST258 and ST512, which are part of the CG258. This lineage has been, for a long time, the most prevalent MDR-carbapenemase producer group [29 32].

Based on our analysis, cluster 307 exhibits substantially higher genome plasticity compared to cluster 258 (P<1E-10). This finding provides a possible explanation for the recent rise of ST307, which appears to be the most plastic. Moreover, considering the mean number of resistance genes per clone, ST307 stands out as one of the most resistant clusters in our dataset. A positive correlation between plasticity and antibiotic resistance can be observed (Fig. S2A; R^2^=0.43, P-value<0.05). This result is in accordance with the findings of Wyres et al. [8]. However, our results also suggest that no direct or inverse relationship is present between virulence and plasticity (R^2^=0.12, P-value>0.05) (Fig. S2B). To hypothesize an explanation for these observations, one should consider that virulence genes mostly imply a constant fitness advantage. So they are more likely to be maintained and transmitted vertically than resistance genes, which are only needed in the presence of the antimicrobial and repeatedly lost and regained (e.g. [33]). As a consequence, our results suggest that the K. pneumoniae strains with a higher plasticity are also more likely to be resistant to antimicrobials.

S. aureus

S. aureus is a versatile Gram-positive bacterium that colonizes the skin and mucous membranes of humans and animals. Whilst it is a common member of the human microbiota, S. aureus can also cause a wide range of infections, from minor skin and soft tissue infections to life-threatening diseases such as bloodstream infections, pneumonia and endocarditis [34]. As K. pneumoniae, S. aureus is known for its ability to acquire and maintain resistance to multiple antimicrobial agents, making it a significant public health concern worldwide [35]. Amongst resistant strains, methicillin-resistant S. aureus is particularly challenging due to the limited availability of alternative treatment options [36]. Genetic factors such as mec genes are responsible for this resistance [37].

ST5 and ST8 are the two major STs in S. aureus and are commonly associated with various types of infections, including those acquired in healthcare settings as well as in the community [38 39]. In our study, the fastBAPS algorithm successfully divided both ST5 and ST8 into distinct subgroups. We observed a noteworthy difference in plasticity within the subgroups of ST8, specifically the cluster 8A which exhibited significantly higher plasticity compared to the other subgroup, cluster 8B (P<1E-10) (Fig. 4). We hypothesized the presence of a highly successful and adaptable sub-strain within ST8, which exhibits the highest level of plasticity in our entire dataset. Subsequently, we investigated the presence of resistance and virulence determinants in these subgroups. Notably, cluster 8A was found to be enriched in specific resistance genes: mecA (responsible for methicillin resistance) was found in all the 8A genomes but only in 60% of the 8B group (Fig. S3); mph(C) and mrs(A) confer resistance to macrolide antibiotics, such as erythromycin, azithromycin and clarithromycin; aph(3’)-III is associated with the resistance to gentamicin, tobramycin and amikacin; ant(6)-Ia provides resistance to aminoglycoside antibiotics, including kanamycin and neomycin. Whilst the virulence pattern remained relatively stable between the two subgroups, we discovered a difference in the presence of chemotaxis inhibitory protein of S. aureus genes, which were found in the majority of genomes within cluster 8A (91 out of 100 genomes) compared to 8B (33 out of 100). This observation warrants further investigation in additional datasets to confirm its consistency and potential biological relevance. These genes play a crucial role as important virulence factors, helping S. aureus evade the innate immune defence systems [40]. These findings underscore the significance of cluster 8A as a particular plastic subgroup within the ST8. On the other hand, the two subgroups of ST5 did not show a significant difference in plasticity, resistance or virulence patterns.

Pointplot FOGS within each cluster identified by fastBAPS [25] in S. aureus. Vertical lines represent the CIs.

E. coli

E. coli is a widely studied Gram-negative bacterium that plays a crucial role in various ecological niches and has significant implications for human health. Previous studies have demonstrated the presence of extensive genetic variation within E. coli, enhanced by its ability to adapt to a range of diverse niches including hospitals, animal reservoirs and natural ecosystems [41 43]. In the clinical setting, E. coli represents a significant public health concern due to its ability to cause a wide range of infections, ranging from urinary tract infections (UTIs) to more severe bloodstream infections. We generally observed wider CIs compared to the other two species, likely reflecting the greater genetic diversity and population structure within E. coli.

ST131 is one of the most predominant and globally disseminated lineages associated with UTIs and bloodstream infections. Frequently associated with multidrug resistance, including extended-spectrum beta-lactamases and fluoroquinolone resistance, it can be considered the most successful MDR clone of all time [44 45]. In our dataset, ST131 was split into two clusters, namely, ‘131A’ and ‘131B’. Cluster ‘131B’ showed the highest FOGS value of the dataset. This was significantly higher in respect to that of cluster ‘131A’ (P<0.01; Fig. 5). ST131’s population structure has been previously investigated, and three genetically distinct clades have been identified (A, B and C), each characterized by different fimbrial adhesin (fimH) gene variants [46]. Our analysis revealed that cluster 131A uniquely contains genomes encoding the fimH41 variant, which is exclusive to the globally recognized cluster A. In contrast, cluster 131B encompasses various fimH variants, including 30 and 40, thereby comprising the global clusters B and C. Within specific subclades of the global cluster C, a convergence of extensive resistance and virulence profiles has been observed [47]. Biggel et al. propose that this convergence may not be applicable to other E. coli lineages, such as ST73 and ST95. Despite being pandemic, these lineages exhibit low antibiotic resistance, potentially attributed to their gene acquisition capabilities [47]. Our findings align with this hypothesis, revealing a lower degree of genomic plasticity in ST73 and ST95.

Pointplot FOGS within each cluster identified by fastBAPS [25] in E. coli. Vertical lines represent the CIs.

Interestingly, cluster 1193 appears to be the second most plastic clone according to FOGS. E. coli ST1193 is currently emerging rapidly across the globe, mimicking the very successful ST131 [48].

Conclusions

Gene gain and loss events represent a complex trade-off for prokaryotes, shaping their adaptive strategies in response to environmental challenges [3]. Several methods have been proposed to explore gene content evolution, including metrics such as Jaccard distance [8 9] and genome fluidity [11]. More recently, time-aware frameworks like Panstripe [15] have been developed to explicitly model gene gain and loss along dated phylogenies. Incorporating a temporal component is crucial, as it allows for distinguishing between short-term variability and long-term evolutionary trends, providing a deeper understanding of genome dynamics. However, such approaches depend on the availability of accurate sampling dates and well-resolved phylogenies – requirements that are not always met in many real-world datasets.

Here, we introduced a novel index, FOGS, designed to quantify genome plasticity by assessing the rate of gene acquisition or loss events across multiple genomes. A key aspect of FOGS is the inclusion of a scaling factor, SNP distances, that enables gene flux to be interpreted in relation to evolutionary divergence. By default, gene gain and loss events are scaled using core SNP distances, which offer high resolution and strong discriminatory power in intraspecific comparisons. Although this approach does not reflect calendar time, in the case studies presented here, SNP-based scaling can be considered a reasonable proxy for evolutionary time. Importantly, the method is flexible and can incorporate alternative measures of evolutionary distance, depending on the dataset and the divergence level being investigated. FOGS works with datasets containing either complete genomes or draft assemblies. The input data should ideally include high-quality gene annotations and clear gene coordinates, as FOGS relies on accurate identification of gene content and their adjacency within the genome. Additionally, datasets with varying levels of genome completeness and quality can be used, but it is recommended that caution be taken when dealing with heavily fragmented genomes, as they may introduce inaccuracies in the calculation of gene events.

We applied FOGS to analyse genome plasticity trends across three bacterial species: K. pneumoniae, S. aureus and E. coli. Our results highlighted distinct plasticity patterns within specific STs and clusters, with notable connections to globally recognized high-risk clones. These findings suggest that FOGS can effectively capture variations in genome plasticity, offering insights into the evolution and adaptation of bacterial species, as well as their potential implications for public health and the emergence of AMR.

FOGS offers a scalable approach to quantifying genomic plasticity across bacterial strains and lineages. Whilst its current application focuses on intraspecific comparisons, the method is flexible and can be adapted to different levels of evolutionary divergence.

If validated further – e.g. through application to longitudinal datasets or retrospective outbreaks with known clinical outcomes – FOGS could potentially allow for the early identification of strains of potential clinical concern (i.e. with a high rate of genomic plasticity), enabling more timely and targeted interventions to control the spread of AMR and prevent outbreaks of pathogenic strains.

Ultimately, the introduction of FOGS represents a powerful tool for the study of genomic plasticity, providing a quantitative measure that can be applied to a wide range of bacterial species. By incorporating the time factor, FOGS not only measures genetic events but also frames them within an evolutionary context, offering a more comprehensive view of how bacteria adapt and evolve over time.

Supplementary material

10.1099/mgen.0.001459Uncited Supplementary Material 1.

10.1099/mgen.0.001459Uncited Table S1.

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ochman H Lawrence JG Groisman EA Lateral gene transfer and the nature of bacterial innovation Nature 200040529930410.1038/3501250010830951 · doi ↗ · pubmed ↗
2Morris JJ Lenski RE Zinser ER The black queen hypothesis: evolution of dependencies through adaptive gene loss m Bio 2012310.1128/m Bio.00036-12PMC 331570322448042 · doi ↗ · pubmed ↗
3Brockhurst MA Harrison E Hall JPJ Richards T Mc Nally A et al The ecology and evolution of pangenomes Curr Biol 201929 R 1094 R 110310.1016/j.cub.2019.08.01231639358 · doi ↗ · pubmed ↗
4Rajer F Sandegren L The role of antibiotic resistance genes in the fitness cost of multiresistance plasmidsm Bio 202213 e 035522110.1128/mbio.03552-2135038907 PMC 8764527 · doi ↗ · pubmed ↗
5Andersson DI Hughes D Selection and transmission of antibiotic-resistant bacteria Microbiol Spectr 20175 Epub ahead of print July 201710.1128/microbiolspec.mtbp-0013-2016 PMC 1168753528752817 · doi ↗ · pubmed ↗
6Andersson DI Levin BR The biological cost of antibiotic resistance Curr Opin Microbiol 1999248949310.1016/s 1369-5274(99)00005-310508723 · doi ↗ · pubmed ↗
7Sela I Wolf YI Koonin EV Selection and genome plasticity as the key factors in the evolution of bacteria Phys Rev X 2019910.1103/Phys Rev X.9.031018 · doi ↗
8Wyres KL Wick RR Judd LM Froumine R Tokolyi A et al Distinct evolutionary dynamics of horizontal gene transfer in drug resistant and virulent clones of Klebsiella pneumoniae P Lo S Genet 201915 e 100811410.1371/journal.pgen.100811430986243 PMC 6483277 · doi ↗ · pubmed ↗