Extensive mobilome dynamics in a widespread endosymbiont: long read metagenomics reveal dimeric plasmids and highly fragmented prophages in Wolbachia from Culex pipiens
Alice Brunner, Maxime Mahout, Julien Amoros, Massilva Rahmoun, Mateo Jarry, Seth R. Bordenstein, Sarah R. Bordenstein, Blandine Trouche, Julie Reveillaud

TL;DR
This study uses long-read sequencing to reveal new insights into the complex mobile genetic elements in Wolbachia bacteria found in Culex pipiens mosquitoes.
Contribution
The first near-complete Wolbachia genome with dimeric plasmid and fragmented prophage structures is assembled using long-read sequencing.
Findings
A dimeric form of the pWCP plasmid was assembled, indicating it is a replicating and active molecule.
Prophage WO regions showed extensive fragmentation and structural complexity despite long-read sequencing.
Multiple alternative gene arrangements in WO regions suggest heterogeneous prophage architectures and diversity in wPip strains.
Abstract
The obligate, intracellular bacteria Wolbachia have gained increasing interest due to their selfish modifications of host arthropod reproduction, impacts on host evolution, and utility in vector control efforts to reduce arbovirus transmission. Despite their highly reduced genomes, Wolbachia harbor a rich global mobilome that includes phages and plasmids in mosquito vectors. However, these mobile genetic elements are structurally complex, and standard genome assemblies often fail to resolve their organization and their functional relationships, leaving gaps in our understanding of how they evolve, mobilize, and influence host genomes. Here, we present the first near-complete genome of Wolbachia and its mobile elements from the vector Culex pipiens molestus in Montpellier (France), reconstructed from Oxford Nanopore long read sequencing of single female ovaries without prior DNA…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInsect symbiosis and bacterial influences · Invertebrate Immune Response Mechanisms · Bacteriophages and microbial interactions
Background
Wolbachia is a genus of obligate, intracellular α-proteobacteria (Rickettsiales) that occur in an estimated half of all arthropod species [1], including numerous vectors of infectious diseases, as well as most pathogenic filarial nematodes [2]. It is the most common bacterial endosymbiont in animals [3, 4]. Primarily transmitted through the maternal germline, Wolbachia are renowned for their remarkable ability to manipulate host reproduction in multiple ways, including feminization, parthenogenesis, male killing and the most common — cytoplasmic incompatibility (CI) — a selfish phenomenon that causes embryonic death in crosses between infected males and uninfected females, or between individuals carrying distinct, incompatible Wolbachia strains [3, 5]. In addition to these reproductive effects, Wolbachia modulate host susceptibility to pathogens, a feature that has sparked intense interest in their use as a biological tool to control mosquito-borne diseases such as dengue, chikungunya, or West Nile virus [6]. While vertical, transovarial transmission from mothers to offspring remains the dominant mode of inheritance, phylogenetic incongruities between host and symbiont lineages indicate that horizontal transfer events occurred during evolution, contributing to the global spread of Wolbachia [7, 8].
A distinguishing driver of Wolbachia evolution and genome plasticity is its mobilome, the set of mobile genetic elements (MGEs) present in its genome or cytoplasm. These elements include phages, plasmids, and small MGEs such as insertion sequences and transposons [7, 9–12]. They can occur in high frequency within some Wolbachia, constituting up to 38% of their genomes [8, 13]. Among these elements, bacteriophage WO is the best studied. It commonly occurs as an intact or relic prophage in several Wolbachia supergroups, including A, B, E, F, and S [14]. WO typically carries a conserved set of core genes involved in functions such as capsid and tail morphogenesis, lysis, and lysogeny [12, 15]. In addition, most WO phages harbor a unique eukaryotic association module (EAM), an accessory region which encodes some proteins with domains from eukaryotic origin, implicating these modules in varied functions of the Wolbachia – insect symbiosis [12, 16]. These modules often include large numbers of ankyrin repeat proteins that are highly duplicated and potentially involved in mediating Wolbachia - host interactions [16–18]. Indeed, EAM regions harbor genes underpinning reproductive manipulations, including wmk and oscar in male killing, pifA- pifB in parthenogenesis, and cifA-cifB in CI [19–22]. The latter code for proteins with varying domain and architectures classified into five [5] and tentatively up to ten phylogenetic Types [23, 24], with at least type I, II, and IV demonstrated as CI effectors [20, 21, 25].
Less studied and recently discovered, plasmids are another component of the Wolbachia mobilome. Initially detected in Wolbachia from Culex pipiens [9] and subsequently in Aedes albopictus [10], these extrachromosomal elements remain largely unexplored. In Culex pipiens, the plasmid pWCP is highly conserved across distant populations and localities worldwide [26], suggesting a recent spread and/or strong purifying selection. A fairly stable copy number per Wolbachia cell also suggests there may be functional consequences with the bacterium during most developmental stages of the mosquito host [27]. However, it is still considered putative due to the absence of a clearly defined origin of replication, and our understanding of the biology and inheritance of the element is still in its early stages.
Recent large-scale genome studies shed light on the sequence diversity of Wolbachia [28]. Within insect hosts, Culex mosquitoes stand out since more than 99% of species in this genus are naturally infected [29], and lineages of the Culex pipiens complex have long served as a reference system for investigating CI and vector control [30–33]. This genus is notable for the diversification of its Wolbachia strains into five wPip phylogenetic groups, named wPip-I to -V, which are directly correlated with CI cross compatibility [29, 34]. To date, only three draft genomes are surprisingly available for Wolbachia infecting Culex mosquitoes: wPipPel (from Culex quinquefasciatus Pel) [35], wPipJHB (from Culex quinquefasciatus JHB) [36], and wPipMol (from Culex pipiens molestus) [37]. The most studied and conventionally used reference, wPipPel, was assembled from pools of embryos, but it is reported as a chromosome with assembly gaps, likely due to complex prophage regions and potential phage activity [35]. Subsequent analyses revealed that key loci such as the cifA-cifB (phylogenetic Type I) are present in multiple copies in individual Culex pipiens samples [32], suggesting that the single copy reported in wPipPel may result from assembly collapse in repeated regions [32]. The second genome, wPipJHB, displayed major structural differences compared to wPipPel, with large rearrangements and some highly similar contigs discarded, potentially masking strain-level variation [36]. The third genome, wPipMol, remained fragmented despite using a different adult sample source [37]. Therefore, while the available wPip genomes in Culex spp. are useful references, they remain consensus, fragmented assemblies that limit investigation of the fine-scale architecture, function, and genomic variation of Wolbachia, as well as the breadth and dynamics of its MGE, which are likely underestimated. Short read sequencing of single ovaries from four Culex pipiens specimen highlighted high phage variation between individuals together with the presence of fragmented contigs belonging to a novel MGE [9]. The fragmented nature of the four Wolbachia Metagenome-Assembled Genomes (MAGs) was not sufficient to resolve key genomic regions, and long read sequencing was necessary to confirm the whole circular pWCP plasmid element. Overall, the presence of numerous repeated regions in Wolbachia genomes in Culex lead to fragmented assemblies and achieving a closed and circular Wolbachia reference genome remains challenging [9, 35, 38, 39].
Here, we reconstruct and analyze a new Wolbachia genome obtained from long read sequencing of individual ovaries from a gravid Culex pipiens molestus specimen. We developed an in-house protocol leveraging the sequencing capacity of MinION technology (ONT) without any upstream nucleic acid amplification to prevent the introduction of possible PCR errors and chimeric sequences between highly repeated genomic regions. In addition, we integrated independent Illumina-sequenced ovary metagenomes to verify specific punctual mutations. A fine scale analysis of raw long reads and coverage data showed a higher number and different organization of WO phage modules compared to previously assembled wPip genomes, and two forms of the plasmids, including a dimeric one, providing the first evidence that pWCP is a self-replicative element and part of a highly dynamic mobilome.
Results
Ovaries with high Wolbachia abundance selected for genome sequencing
To maximize the likelihood of assembling a complete Wolbachia genome, we first evaluated the bacterial load in gravid ovaries from five females of the Culex pipiens molestus Celestin strain using quantitative PCR (qPCR). The average Wolbachia abundance, estimated by normalizing wsp to the host reference gene ace2, was 43.1 copies per host genome equivalent—indicating a high bacterial abundance in ovaries suitable for long read sequencing. qPCR also provided insights into the relative copy number of Wolbachia-associated mobile elements. The plasmid gene GP11 showed an average of 1.7 copies per Wolbachia, suggesting nearly two plasmid copies per bacterial genome and the phage marker gpw, representative of the WO connector/baseplate module, exhibited an average of 9.5 copies per Wolbachia (Supplementary Table 3).
Fluorescent in situ hybridization (FISH) using Wolbachia-specific 16S rRNA probes was then performed on ovarian tissues from the same strain to complement molecular estimates. Observations revealed a strong and widespread bacterial signal across the ovary, with Wolbachia clearly visible around oocyte nuclei and within follicular cells (Fig. 1).
Overall, data confirmed both the high bacterial density and the homogeneous spatial organization of intracellular Wolbachia in the ovaries of Culex pipiens molestus, supporting the relevance of this tissue for genome reconstruction.
Wolbachia are located in the ovarian follicles (A-F). Colored dashed boxes (A to C) indicate area of magnification (x2.6) for images below (D to F). Wolbachia is imaged with an Atto 488-labeled probe targeting the Wolbachia 16S rRNA gene (orange, B and E), and DNA is stained with DAPI (blue, A and D).
Nanopore sequencing of a single ovary sample reconstructs the w PipCelG genome
Nanopore sequencing of dissected ovaries from a gravid Culex female generated a total of 309,559 reads (median read length: 1,223 bp; maximum read length: 119,034 bp). Alignment of raw sequences to the spiked in lambda phage genome showed a mean read accuracy of 94.95%, corresponding to an estimated error rate of ~ 5%, consistent with expectations for high quality ONT libraries [40]. The initial assembly produced a Wolbachia metagenome-assembled genome (MAG, recruiting 26,102 reads representing 8.43% of total reads) composed of 24 contigs with 91.5% completeness and 1.4% redundancy, estimated from the bacterial single copy core genes collection of Campbell et al. [41], consistent with Wolbachia endosymbiont genomes [9, 42]. Comparative pangenome analysis with three reference wPip genomes confirmed high sequence similarity across aligned regions, with the newly assembled genome wPipCelG sharing 99.84% identity with wPipPel (group I of wPip), 99.83% with wPipJHB (group I) and 99.63% with wPipMol (group IV). Phylogenetic analyses based on six diagnostic loci further assigned wPipCelG to wPip group I, consistent with its close relationship to both wPipPel and wPipJHB (see Supplementary Fig. 1). The resulting pangenome, comprising 1,166 gene clusters based on BLAST homology, revealed that wPipCelG shares 1,056 gene clusters with the three reference genomes (conserved wPip core). Among accessory genes, 10 gene clusters were unique to wPipCelG, while 8 gene clusters present in all three references were absent, suggesting both specific acquisition and potential gene loss (Supplementary Fig. 2 and Supplementary Table 4).
As described in the Method Section, a manual refinement scaffolding step was applied to improve assembly quality by removing zero-coverage and redundant contigs. Read mapping and pairwise BLASTn comparisons revealed that ten small contigs either exhibited negligible coverage or showed 100% sequence identity to internal regions of larger contigs. These were considered likely assembly artefacts or redundant fragments and were excluded from the final genome assembly. Among the 15 remaining contigs, contig_2 (1,186,930 bp) displayed strong homology to the Wolbachia core genome of wPipPel, while contig_9 (18,485 bp) corresponded to a dimeric form of pWCP plasmid. The remaining 13 contigs appeared to include WO prophage regions based on gene content.
pWCP plasmid found in monomeric and dimeric form
The reconstructed plasmid of contig_9 measured 18,485 bp, approximately twice the size of the monomeric pWCP reference (9,228 bp; [9]), suggesting a dimeric plasmid configuration. Alignment to the reference sequence revealed a high degree of structural conservation (99.77% identity), with a 16-nucleotide insertion in the VNTR region corresponding to the addition of a tandem repeat (increasing from six to seven copies) (Supplementary Fig. 3). A few additional differences were observed, including three insertions, one deletion, and a C to T substitution upstream of the VNTR but overall, the assembled plasmid (contig_9) was highly similar to the reference plasmid pWCP, consistent with the strong conservation previously reported [26].
Mapping of Illumina short reads from the three additional pairs of ovaries (Culex pipiens molestus, Celestin strain) to the wPipCelG assembly indicated that the four single nucleotide insertion/deletion mutations observed in the dimeric plasmid contig were in fact attributable to basecalling errors specific to Nanopore sequencing, particularly in the homopolymeric regions. These errors affected the accurate counting of repeated identical bases, leading to false insertions or deletions in the assembled sequence. A single base substitution (C instead of T) detected within a variable number tandem repeat (VNTR) region appeared to result from the presence of one additional repeat unit in the assembly, which likely introduced an alignment ambiguity rather than reflecting a true point mutation.
To investigate long reads matching accurately to the plasmid independently of any assembly, we selected all reads mapping to contig_9 of over 9,000 bp, giving us a total of 24 high-quality long reads (with a MAPQ = 60). Alignment of these long reads to the pWCP reference yielded a read accuracy estimation of 94.53%, consistent with the general error rate measured from reads mapped to the lambda genome. The 24 long reads were individually aligned to distinguish between monomeric and dimeric forms. Among them, four raw reads (> 14 kb) supported the hypothesis of a dimeric plasmid, while 20 reads (< 11kb) more likely supported the canonical 9,228 bp monomeric plasmid form. We represented these four reads (Fig. 2A) and the seven longest reads (Fig. 2B) supporting a ca. 18kb and 9kb plasmid, respectively, in Fig. 2. Notably, two long reads, read 1 and read 2 measuring 22,496 bp and 19,413 bp, respectively, covered more than twice the length of the monomeric plasmid, highlighting the presence of a full-length dimeric plasmid molecule independent of possible assembly artefacts (Fig. 2A). In addition, read 3 (18,092 bp) aligned contiguously to the reference sequence for 14,123 bp but the remaining region matched pWCP with gaps (not highlighted on Fig. 2A).
Evidence supporting a replicative plasmid
We examined the start and end positions of the 24 plasmid reads and found that most started and ended at apparently random positions along the plasmid. However, two long reads of nearly identical length but opposite orientation (Read 5: 10,328 bp and read 6: 10,322 bp) both started and ended at the same positions (Fig. 2B). This observation could suggest that these reads originated from the forward and reverse strands of a single DNA molecule undergoing replication while adopting a coiled conformation. The presence of overlapping regions within these reads also aligns with a double-stranded plasmid possibly replicating via a theta-like mechanism. A scheme for how these sequences may arise is provided in Supplementary Fig. 4B. In addition, another long read (Read 7: 10,204 bp) exhibited a complex structure, beginning with a ~ 400 bp region in antisense orientation, followed by the same sequence in sense orientation, then continuing through the rest of the plasmid and ending with an additional copy of the initial region in sense. This arrangement also supports the possibility of replication associated structural intermediates, and a schematic hypothesis of such a configuration is presented in Supplementary Fig. 4C.
Independently, we analyzed the reference pWCP sequence using OriV-Finder [43] to further explore potential replication features that could explain such multimeric configurations. This analysis identified a type II origin of replication (oriV) located between positions 6,363 and 6,513 bp (Supplementary Fig. 5). This region overlapped with the previously annotated VNTR composed of 16-bp tandem repeats and displayed features consistent with iteron-based replication initiation, a mechanism previously described in association with multimeric plasmid configurations.
Structural organization of the Wolbachia and prophage WO genomes
The Wolbachia core genome reconstructed from contig_2 (1,186,930 bp) exhibited an average coverage of approximately 30X. The bacterial origin of replication was identified between positions 426,616 and 427,120 bp in this contig. In addition to this single large contig, phage-rich regions were recovered as multiple smaller contigs (12 contigs), reflecting the challenge of assembling highly repetitive prophage sequences (Supplementary Fig. 6A). This observed fragmentation and incomplete recovery of Wolbachia genomes despite the successful long-read sequencing of a single individual without prior amplification highlighted the structural complexity of phage loci. More specifically, the unassembled prophage regions appear to be distant from the predicted bacterial origin of replication (oriC) [44]. Consistent with this pattern, phylogenetic analysis of the large serine recombinase places the wPipCelG WO element within the sr3WO clade as defined by Bordenstein & Bordenstein (2022) [12] (Supplementary Fig. 7), a group for which distal phage integration relative to oriC has been reported. Within contig_2, phage-related genes were detected at both ends of the contig. Among them, a cifB gene from phylogenetic Type IX [23] was identified at positions 6,131 to 8,827 (corresponding to WP0462 in wPipPel), along with several other phage associated genes, including two head structure genes located at positions 1,154,024 to 1,155,632. Additionally, a partial sequence of an Octomom-like region was identified towards the terminal end of contig_2 (from 1,167,601 to 1,186,910 bp). This fragment contained ‘toxicity-associated ‘domains, predicted eukaryotic-like toxin, including WP1346 and WP1348 together with a partial WP1349 from wPipPel interspersed with transposase sequences (WP1345 and WP1347), a genomic organization similar to what has been described for the Octomom-like region in wPipPel [12].
Phage WO rearrangements
Following manual curation, the 13 contigs containing prophage WO related regions were further analyzed, and the gene synteny, functional annotation, and sequencing coverage of the seven longest (from 7964 to 112962 bp) are illustrated in Supplementary Fig. 6.B. We observed core genes flanking EAM regions in almost all contigs, in agreement with wPipPel reference genome [12, 35]. Most contigs exhibited non-uniform coverage, most likely due to repeated genes recruiting reads in a non-specific manner. This confirms that for a large fraction of WO regions, the assembly is unreliable and highlights the intrinsic difficulty of assembling these repeat-rich loci. A minority of contigs (e.g., contig 3, contig 22) showed near-uniform coverage.
We next focused on the relative abundance of representative genes from the main structural modules of WO phage, namely, the head, connector/baseplate, tail, and tail fiber components. For each category, we calculated the average sequencing coverage (Fig. 3), which served as a proxy for estimating copy number. Head genes showed coverage levels consistent with approximately 6 to 7 copies, connector/baseplate genes with 6 to 7 copies, and tail and tail fiber genes with 3 to 5 copies. These values suggest that the genome contains multiple intact and relic prophage elements, the latter of which have experienced gene loss and/or rearrangements in one Wolbachia. The higher abundance of head and baseplate genes relative to the rest of the phage WO genome is a common feature of prophage WO [45], and the intact prophage variants are compatible with ongoing constitutive phage activity. In addition to structural components, we also assessed the copy number of phage-associated genes involved in host manipulation and proliferation, including cifA-cifB and the Octomom-like region. On average, we estimated four copies of the Type I cif genes per Wolbachia genome, with both genes located on contig_10. The Type IV cif genes were detected as single copies, also located on contig_10. We counted a single copy of the Octomom-like region identified at the terminal end of contig_2, in agreement with assembly data. Details of the read coverage and the estimated copy number of each WO phage gene are provided in Supplementary Fig. 8.
Estimated copy numbers (indicated above each boxplot) were inferred by linear regression, based on the relationship between assembled copy numbers in gene cluster and coverage in wPipCelG. For phage structural and replicative modules, the value shown represents the average estimated number of gene copies per module.
To further investigate these prophage-rich regions independently of any assembly, we examined gene syntenies directly from annotated raw long reads (≥ 10 kb). First, we noted several independent long reads (6 reads > 10kb) carrying annotations from regions assigned to WOPip1 and WOPip5 in the wPipPel reference on the same molecule, supporting that WOPip1 and WOPip5 are likely adjacent and/or rearranged in wPipCelG (Fig. 4.A). By contrast, these prophages are non-contiguous in wPipPel [12].
Within the same region, we also noted a transposase (WP1350) that is absent from the wPipPel reference, highlighting mobility and/or local amplification of transposases within Wolbachia.
Second, we identified three configurations of Type I cifA-cifB region (Fig. 4.B): two cifA-cifB palindromes with truncated cifB (supported by 8 and 7 reads (> 10kb), respectively) and one full-length cifA-cifB tandem (4 > 10kb reads). In total, these represent five instances of the cifA–cifB pair. We then observed two configurations for cifA-cifB Type IV (Fig. 4.C), including the classical tandem configuration present in wPipPel (supported by eight reads) and a palindrome cifA–cifB-cifB-cifA supported by a single ≥ 10 kb read. These findings suggest rapid evolution of these host manipulation genes.
Separately, we attempted generating a “cif synteny long read gene-adjacency graph” (Supplementary Fig. 9). Starting from the cifA Type I gene (WP0282 in wPipPel), we enumerated all observed transitions to the next gene along individual reads and annotated these transitions in each read). This analysis revealed multiple gene-order configurations emanating from cifA, in agreement with the non-uniform coverage and contig fragmentation observed in WO regions.
Discussion
We performed long read sequencing on gravid ovaries from a single female mosquito individual, without prior DNA amplification, providing an optimized and unique experimental framework to explore the genomic architecture of Wolbachia. This strategy which limited amplification biases and avoided the mixing of potential variants across individuals or tissues [42], enabled the direct capture of genomic elements that are typically difficult to access, including rare or less frequent forms of the mobilome. Additional scrutiny of both raw long read and short read datasets, obtained from independent samples, allowed us to curate the assembly and highlight a set of active MGE in the Wolbachia endosymbiont from Culex.
Plasmid structure and replication
The assembly of a putative dimeric plasmid, supported by several long reads spanning two or more copies of the plasmid, suggests that this element may be actively replicating and may naturally exist in multimeric forms, dimers and monomers, within the cell. Two main hypotheses can be proposed to explain this observation. On one hand, these multimeric forms may result from plasmids undergoing replication at the time of DNA extraction. In this context, specific replication dynamics might be occurring at this particular developmental stage (gravid ovaries), where increased replication activity could favor the capture of intermediate forms during sequencing. Notably, both major plasmid replication mechanisms, theta-replication and sigma-replication, are known to generate concatemers, especially when termination of replication fails [46–48]. On the other hand, these forms may originate from unresolved recombination events [49, 50]. In many low-copy, theta-replicating plasmids, specific resolution systems (such as res or cer sites) are responsible for efficient multimer separation [49, 51]. Although no such site was detected in the analyzed sequence, their apparent absence does not exclude the possibility of a non-canonical resolution mechanism. Without such a system, persistent multimeric forms could impair proper plasmid segregation during cell division and represent a burden for the host cell [52]. Nevertheless, some studies suggest that these forms may not necessarily destabilize the plasmid but could impose a significant energetic cost due to unregulated amplification [53].
Sequence analysis of the plasmid revealed a putative origin of replication (oriV) located within a region previously annotated as a Variable Number Tandem Repeat (VNTR), composed of 16-nucleotide tandem repeats. The exact position of this origin within the repeated region, along with the conservation and structure of the motifs, supports a replication control mechanism involving iterons, as described in theta-replicating plasmids [54]. Iterons are short repeated sequences that serve as binding sites for replication initiator proteins, facilitating local DNA unwinding and regulating plasmid copy number [54]. Although no rep gene was identified within the plasmid itself, the bacterial host encodes a repA prophage gene, which could potentially participate in initiating replication. In addition, the presence of a parA-like gene upstream of this VNTR region suggests an active partitioning system, which is consistent with the behavior of low-copy-number plasmids [55]. Taken together, this genomic context, along with the detection of long reads slightly larger than the monomeric plasmid and exhibiting overlapping region, supports the hypothesis of a theta-type replication mode, which has not been previously proposed for this plasmid.
Assembly limitations and underestimation of WO phage diversity
Our reconstruction and analysis of a novel Wolbachia genome wPipCelG from Culex pipiens molestus strain at the organ level highlighted persistent challenges in assembling prophage regions, even under optimized experimental conditions involving long read sequencing from a single reproductive organ without prior DNA amplification. Twelve contigs containing phage genes were reconstructed, reflecting the high fragmentation of this genome. This likely resulted from the highly repetitive nature of prophage regions and possibly the presence of active phage particles [12, 35]. Coverage profiling across WO regions further revealed a discrepancy between the expected copy number and the number of WO segments recovered by assembly, consistent with the underrepresentation of prophage diversity and/or free phage WO at a constitutive level. This pattern accords with prior observations [32] of mismatches between cif copy numbers estimated by qPCR and those annotated in the wPipPel reference, indicating that assemblies alone can be incomplete for multicopy loci. In our dataset, the overall genomic structure of wPipCelG and wPipPel (both wPip-I) is broadly similar, yet coverage in wPipCelG supports ~ 4 copies of the Type I cifA-cifB pair (consistent with the 4–6 range reported for Culex pipiens), despite single-copy annotations in both assembled genomes. Taken together, these observations illustrate how copy number estimation based solely on genome assemblies may misrepresent the true genomic landscape, particularly for multicopy genes embedded within repetitive regions. In line with this, our read-level inspection revealed multiple alternative syntenies around cif loci (Types I and IV), and the number of distinct arrangements does not fully track the coverage-based copy number estimates. Recent long read studies across wPip groups have likewise reported distinct syntenies around cif genes [56]. In the wPip-I (Tunis strain), three different cif adjacent environments were documented and termed cif-Type I genomic islands. Consistent with these findings, inspection of the raw long reads from wPipCelG (wPip-I) also identified distinct cif-Type I genomic islands. While the overall pattern agrees with the wPip-I (Tunis) report [56], the exact compositions are not identical: one additional configuration observed in wPipCelG (Palindrome 2) was not previously reported, whereas one of the previous configurations (Tunisp-d) is absent from our dataset, suggesting that variability in these regions is likely not dependent on wPip group.
Notably, the number of distinct cif configurations exceeds coverage-based copy number estimates (five Type I cifA-cifB tandem vs. four copies inferred from coverage; for Type IV cif, coverage supports one copy yet two configurations were observed). These modest mismatches between coverage-based copy number and the number of cif configurations could suggest the presence of distinct syntenic configurations coexisting among Wolbachia cells (i.e., subpopulations) within the same host. Overall, these observations point to structural heterogeneity both between mosquito strains within the same wPip group and within individual hosts, where distinct Wolbachia genomes bearing different prophage organizations may co-occur and/or lytic WO forms may contribute additional configurations. We note that some configurations are supported by few long reads, and although chimeric molecules cannot be excluded, such patterns may reflect rare states present at low frequency. Given the low sequencing depth of this sample, with configurations supported by at most eight reads (> 10kb), single occurrence observations remain noteworthy and potentially meaningful, warranting further investigation. The fast evolving long read sequencing technologies may allow much deeper sequencing of such low biomass samples in a near future, shedding light on these and other rare phage WO variants.
Based on these data, it is conceivable that a substantial portion of WO phage diversity and structure remains undetected by current assembly approaches, even when using high-quality long read data. In support of this idea, our qPCR results indicate in addition to a high Wolbachia load in the gravid ovaries of Celestin strain, with an estimated 40 bacteria per host cell, an average of approximately 9 copies of a targeted phage gene (present in 3 to 4 copies in the wPipCelG and wPipPel assemblies, respectively). These values are consistent with the coverage-based estimations afforded in this study and further the idea that phage diversity and/or abundance in this strain and possibly other Culex ones is strongly underestimated.
Conclusion
This study illustrates the importance of critically assessing assembly outputs, especially when working with highly repeated and mobile regions prone to structural complexity within an endosymbiont genome. When feasible, the combination of long read and short read data, along with direct inspection of raw reads and coverage profiles, offers a valuable strategy to distinguish biological genomic features from mutational or assembly artefacts. In our case, long read sequencing without prior DNA amplification enabled direct access to single-organism genomic information, while complementary short read sequencing from additional samples helped refine and validate the plasmid assembly. Thanks to this approach, we provide here a new genome for Wolbachia of Culex pipiens molestus, characterized both at the genomic level (with some WO-rich regions still partly unresolved) and through in situ visualization within the host tissue. In addition, our observations further illustrate the dynamic nature of the Wolbachia genome in Culex pipiens, particularly regarding its mobilome. The structural variability and signs of activity associated with both plasmid and phage elements support that these components may play an ongoing and significant role in the genomic plasticity of this symbiont. Notably, the pWCP element, long regarded as a putative plasmid, is here confirmed as a replicative plasmid, carrying a functional origin of replication and occurring in both monomeric and dimeric forms. These findings echo previous suggestions that, despite the streamlined nature typically associated with intracellular genomes, the acquisition and activity of MGE may serve as a source of diversification and adaptive potential for Wolbachia, helping to partially counteract the reductive pressures of its intracellular lifestyle [7, 57].
Methods
Quantification and visualization of Wolbachia prior to sequencing
We assessed the abundance and localization of wPip in ovaries from Culex pipiens molestus (Celestin strain) to ensure the tissue suitability for long read sequencing. This mosquito strain was collected in Montpellier (France) in 2023 and has since been maintained as a laboratory colony under standard insectary conditions [27]. These exploratory analyses were conducted prior to DNA extraction for Nanopore sequencing.
Gravid ovaries from Culex pipiens molestus Celestin individuals were dissected following previous methods [58], and total DNA was extracted using the DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany), following the manufacturer’s protocol for quantification. Quantitative PCR (qPCR) was performed to assess the abundance of Wolbachia and associated MGE, targeting four genes: wsp (Wolbachia surface protein), GP11 (plasmid pWCP), GPW (WO phage connector/ baseplate module), and ace2 (mosquito acetylcholinesterase gene, used as host reference). Among these, wsp, GP11, and ace2 are present as single-copy genes in their respective genomes, while GPW occurs in multiple copies (four in the wPipPel reference genome). All qPCR conditions (thermal cycling, reagents, instrumentation, and quantification based on synthetic plasmid standards) followed the protocol described previously [27]. The relative abundance of Wolbachia was calculated by normalizing wsp to the host gene ace2, as in the referenced study. To estimate the relative abundance of MGEs within the bacterium, GP11 and GPW copy numbers were normalized to wsp.
Fluorescent in situ hybridization (FISH) was performed on dissected ovaries in 1X phosphate-buffered saline (PBS), following the same dissection protocol. Samples were fixed in 4% paraformaldehyde (PFA) for one to three hours at 4°C, then washed two to three times with 1X PBS to remove residual PFA. Following washes, samples were stored at − 20°C in 1X PBS containing 50% ethanol until hybridization.
Hybridization was carried out overnight at 46°C in 180 μL of hybridization buffer (35% formamide, 0.9 M NaCl, 20 mM Tris-HCl [pH 7.6], 0.01% SDS) supplemented with 20 μL of each probe at 30 ng/μL. Two Wolbachia-specific 16S rRNA probes were used: W2: 5′-CTTCTGTGAGTACCGTCATTATC-3′ and W3: 5′-AACCGACCCTATCCCTTCGAATA-3′, both labeled with Atto 488 (Eurofins). After hybridization, samples were washed twice in 1 mL of pre-warmed washing buffer (20 mM Tris-HCl [pH 7.6], 70 mM NaCl, 5 mM EDTA, 0.01% SDS) for 30 min and 15 min respectively at 48°C, followed by an additional 15 min wash in 1X PBS at room temperature. Samples were stained with DAPI (1 μg/mL, Sigma-Aldrich), rinsed in 1X PBS, dried, and mounted using ProLong^™^ Gold Antifade Reagent (Invitrogen, ThermoFisher Scientific). Imaging was performed using a Leica Stellaris confocal microscope equipped with a 63x objective.
Long read sequencing
Gravid ovaries from a single Culex pipiens molestus (Celestin, strain) were dissected and genomic DNA was extracted using a phenol-chloroform protocol optimized to preserve high molecular weight fragments. DNA concentration was measured with a Qubit fluorometer (High sensitivity DNA kit, Thermo Fisher Scientific) and the fragment size distribution was assessed with a Bioanalyzer (Agilent Technologies). Extracted DNA was stored at −20°C until sequencing. Library preparation for Oxford Nanopore sequencing was carried out using the Ligation Sequencing Kit for genomic DNA (SQK-LSK110, Oxford Nanopore Technologies), following the manufacturer’s protocol. To avoid PCR amplification of the extracted ovarian DNA (which is commonly performed to reach the minimum DNA input requirement but may introduce amplification biases), we opted for a direct sequencing approach. To achieve the required minimum input of 1 μg of DNA, we supplemented the extracted sample with 1000 ng of lambda phage DNA (Supplementary Table 1), a well-characterized genome that served as an inert carrier and could later be filtered out from the sequencing data. A blast search between the lambda sequence (NCBI: J02459.1) and reference wPipPel genome (GCF_000073005.1) to look for potential similar similarity revealed no homology, ensuring that downstream analyses remained unaffected. The sequencing run was performed on a MinION Mk1C and lasted 24 hours.
Nanopore data processing and Wolbachia assembly
Raw Nanopore sequencing data was first base-called using Guppy v6.5.7 (Oxford Nanopore Technologies) with the Super Accuracy (SUP) model to convert FAST5 files into FASTQ format while simultaneously trimming residual sequencing adapters. A preliminary quality control step was then performed using NanoPlot [59] to assess the distribution of read quality scores and lengths. Reads mapping to the Culex pipiens molestus reference genome (NCBI: CM044968.1) and the lambda phage genome (NCBI: J02459.1) were identified using Minimap2 [60] and removed with Illumina-utils [61]. To prevent the potential removal of wPip sequences due to close identity with the host, we masked parts of the Culex pipiens molestus genome with over 0.85 identity to Wolbachia pipientis over 80 bp length, or with entropy lower than 0.7, using the BBtools [62] suite. This filtering step aimed to focus downstream bioinformatic analyses specifically on Wolbachia derived reads, thereby optimizing computational efficiency. The resulting filtered long reads were assembled using Flye [63].
Assembled contigs were then imported into anvi’o v8 [64] for genome annotation and interactive visualization. Functional annotations were performed using the KOfam (KEGG Orthology) [65] and COG (Clusters of Orthologous Groups) [66] databases. A taxonomic classification of predicted genes was carried out with Centrifuge [67], in order to verify the identity of contigs and guide genome binning.
We manually refined the resulting Wolbachia MAG that we hereafter named wPipCelG (for Wolbachia in Culex pipiens molestus, Celestin strain, gravid ovaries), using the interactive program “anvi-refine”, thereby generating a novel reference genome for the Culex pipiens complex (Celestin strain, gravid female). Lastly, we applied Proovframe [68] to correct fragmented open reading frames (ORFs) resulting from long read sequencing, with the goal of improving the overall quality of the genome annotation. These analyses are reported in Supplementary Notebook 1.
Pangenomic analyses
We conducted a pangenomic analysis using anvi’o v8 following the established workflow detailed in https://merenlab.org/2016/11/08/pangenomics-v2/. The analysis included three reference wPip genomes: wPipPel (GCF_000073005.1), wPipJHB (GCF_000156735.1), and wPipMol (GCF_000723225.2). The pangenome was constructed using the anvi-pan-genome command with the parameters --use-ncbi-blast to compute pairwise sequence similarity using BLASTp, with default sequence similarity threshold –min-bit 0.5, and with --mcl-inflation 10 to control the granularity of gene clustering. This high inflation value was chosen to ensure stringent clustering, so that only highly similar gene sequences were grouped into the same gene cluster (defined as groups of homologous genes based on amino acid sequence similarity). We also computed pairwise average nucleotide identity (ANI) and sequence coverage between wPipCelG and each reference genome using the pyANI toolkit [69] with the ANIm symmetric method [70] to complement the gene-based comparison.
The resulting pangenome comprised 1,166 gene clusters, among which 114 clusters from wPipCelG were annotated with functions matching previously characterized WO phage genes (WOPip1 to WOPip5, [12]). For each of these phage-associated clusters, all paralogous sequences from wPipCelG were extracted. The longest sequence among each set of paralogues was selected as the representative gene for downstream analysis.
We focused on genes belonging to key WO structural and replicative modules—head, connector/baseplate, tail, tail fiber and replication/repair —as well as on genes involved in reproductive manipulation and proliferation, such as cif and Octomom genes, respectively, to estimate the number of prophage gene copies [12]. Filtered long reads were mapped to the representative sequences of each gene cluster (defined as the longest gene sequence within the cluster from the pangenomic analysis) for structural and replicative modules while on unique sequences for cif and Octomom genes, using Minimap2 [60], and the coverage for each gene was calculated. Given the difficulty to assemble repeated regions and to convert coverage into estimated copy number, we performed a linear regression analysis using the read coverage of wPipCelG genes as predictors and their actual copy number within the genome as the response variable. The resulting model yielded an R^2^ of 0.707, a slope of 0.03, and an intercept of − 1.09. This regression equation was then used to estimate the copy number of each WO phage gene cluster based on its coverage. All data processing steps, model training, and coverage calculations are provided in Supplementary Notebook 2.
Refinement and annotation of contigs
The initial wPipCelG genome assembly performed using Flye [63] produced an initial set of 24 contigs after which we conducted a manual scaffolding step. Raw reads were aligned back to the assembled contigs using Minimap2 [60] with default parameters for Oxford Nanopore reads. Read alignments were then filtered with Samtools v1.10 [71], keeping only reads with a mapping quality above 2 (MAPQ ≥ 2), in order to exclude low confidence and secondary alignments. We assessed the effect of this procedure in Supplementary Notebook 3. For each contig, summary statistics including depth of coverage, number of mapped reads, read length distribution, and N50 were calculated (see Supplementary Table 2). Contigs with no or negligible coverage were flagged as potential artefacts. To further assess redundancy, a pairwise BLASTn analysis was performed between all contigs. Contigs showing 100% identity to internal regions of larger contigs were considered redundant or misassembled and were manually excluded from the final MAG.
After manual refinement, the remaining contigs were individually annotated to better characterize phage-related elements. Functional annotation was performed using a database specifically curated for WO prophage genes [24]. In parallel, the origin of replication of the wPipCelG genome was predicted using ORI-Finder [44].
Phylogenetic analyses
All phylogenetic analyses were performed with Maximum Likelihood (ML) method using RAxML-NG v1.1.0 [72], and 1,000 bootstrap replicates to assess node support. The best fit amino acid substitution models were identified using ModelTest-NG v0.1.7 [73], based on the Akaike Information Criterion corrected (AICc). A ML tree was generated using concatenated nucleic sequence of six marker genes, pk1, ank2, pk2, mutL, gp12, and gp15 to determine the wPip group of wPipCelG [34]. The srWO groups of the WO prophages was assessed by constructing a ML tree based on the amino acid sequences of the large serine recombinases as defined by Bordenstein & Bordenstein (2022) [12].
Analyses of plasmid contigs and reads
Contig_9 (annotated as a dimeric version of pWCP) was aligned against the full length pWCP reference sequence with Jalview [74], allowing for the identification of nucleotide level differences. As highlighted by Trigodet and colleagues, long read assemblies, particularly of circular or repeated elements, may introduce structural artefacts that can affect biological interpretation [75]. We therefore examined the corresponding raw long reads to confirm the structural accuracy of the assembled plasmid sequences. Long reads mapping to contig_9 were extracted using Samtools v1.10, retaining only those with a mapping quality (MAPQ) score ≥ 2. Then, we selected only reads longer than 9,000 bp with an in-house awk script after which they were aligned to the reference pWCP sequence and to the dimeric contig (see Supplementary Notebook 4).
Since the ligation sequencing kit for gDNA involves the addition of adapters to both ends of DNA fragments for sequencing, we hypothesized there is a risk that adapters may remain ligated between two reads after sequencing, leading to the formation of chimeric sequences. To verify that this was not the case, we checked for the absence of adapters in the extracted reads with a BLAST analysis and inspection in Integrative Genomics Viewer (IGV v2.17.3) (see Supplementary Notebook 5).
Identification of the origin of replication and iterons in the pWCP reference sequence
We conducted an independent analysis of the pWCP reference sequence [9] to identify replication features and investigate the potential mechanism underlying the multimeric form of the plasmid observed in our data. The putative origin of replication (oriV) was predicted using OriV-Finder for plasmids [43], a dedicated web server that integrates conserved replication initiation protein domains, iteron-like repeat motifs, and AT-rich regions to detect candidate oriV sequences.
Long read based analysis of WO phage regions
To probe synteny and potential rearrangements in WO prophage regions directly from the sequencing signal, we analyzed the raw Nanopore reads directly. From the genome-wide long read alignment to the wPipCelG assembly, we selected reads longer than 10,000 bp that did not map to contig_2 (the Wolbachia core-genome contig) and contig_9 (the dimeric plasmid), yielding 254 long reads. Each read was annotated by nucleotide alignment against the wPipPel reference genome (accession number: AM999887.1) using BLASTn. We retained for each read up to five, non-overlapping high-scoring segment pairs, ranked by bit score and percent identity (> 80%), and discarded lower-scoring overlaps to accommodate possible intra read structural rearrangements relative to the reference.
Then, we derived per read gene–gene adjacency counts from the annotated long reads (≥ 10 kb) to capture alternative gene orders underlying potential mis-assembly or assembly difficulty. For each read, WO gene hits were ordered by their coordinates on the read, and every consecutive pair was tallied as an adjacency event. Adjacency counts were then aggregated across reads to obtain edge weights and used to build a compact synteny graph centered on the Type I cif genes.
Short read sequencing of Culex pipiens molestus ovaries
Due to the limited amount of DNA that can be extracted from a single pair of ovaries, it was not technically feasible to perform hybrid sequencing (i.e., both long and short read sequencing) on the same individual without prior DNA amplification. We collected and sequenced additional ovary samples from the same mosquito strain using Illumina short reads to overcome this limitation and allow for subsequent error correction of the long read data. Genomic DNA was extracted from three independent pairs of ovaries (same mosquito strain and dissection protocol) using the DNeasy Blood & Tissue Kit (Qiagen). Library preparation and Illumina NovaSeq 6000 sequencing (2×150 nt, S4 flow cell) were performed by the MGX platform (Montpellier GenomiX, France) using the Illumina DNA Prep Kit. Tagmentation, indexing, quantification, and quality controls were carried out according to the manufacturer’s default protocol. Sequencing data were demultiplexed with bcl2fastq (v2.20.0.422), and quality assessed with FastQC (v0.11.9) and FastQ Screen.
Short read-based verification of candidate mutations
The raw short reads from the three Celestin strain samples were processed using the AliNE workflow (version 1.1.0; [76]), which includes trimming with fastp, quality control with FastQC, and mapping with Bowtie2 to the newly assembled wPipCelG genome. Alignments were used to re-examine the mutations initially identified in the dimeric plasmid relative to the reference sequence and determine whether Nanopore-specific mutations were also detected in Illumina reads.
Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Weinert LA, Araujo-Jnr EV, Ahmed MZ, Welch JJ. The incidence of bacterial endosymbionts in terrestrial arthropods. Royal Soc. 2015;282:20150249.
- 2Taylor MJ, Bandi C, Hoerauf A. Wolbachia. Bacterial endosymbionts of filarial nematodes. In: Baker JR, Muller R, Rollinson D, editors. Advances in Parasitology. Academic; 2005. pp. 245–84.
- 3Werren JH, Baldo L, Clark ME. Wolbachia: master manipulators of invertebrate biology. Nat Rev Microbiol. 2008;6:741–51.18794912 10.1038/nrmicro 1969 · doi ↗ · pubmed ↗
- 4Le Page D, Bordenstein SR. Wolbachia: Can we save lives with a great pandemic? Trends in Parasitology. 2013;29:385–93.23845310 10.1016/j.pt.2013.06.003PMC 3775348 · doi ↗ · pubmed ↗
- 5Shropshire JD, Leigh B, Bordenstein SR. Symbiont-mediated cytoplasmic incompatibility: What have we learned in 50 years? Weigel D, editor. e Life. 2020;9:e 61989.
- 6Minwuyelet A, Petronio GP, Yewhalaw D, Sciarretta A, Magnifico I, Nicolosi D Symbiotic Wolbachia in mosquitoes and its role in reducing the transmission of mosquito-borne diseases: updates and prospects. Front Microbiol. 2023;14.
- 7Kent BN, Salichos L, Gibbons JG, Rokas A, Newton ILG, Clark ME, Complete Bacteriophage Transfer in a Bacterial Endosymbiont (Wolbachia) Determined by Targeted Genome Capture. Genome Biol Evol. 2011;3:209–18.21292630 10.1093/gbe/evr 007PMC 3068000 · doi ↗ · pubmed ↗
- 8Vancaester E, Blaxter M. Phylogenomic analysis of Wolbachia genomes from the Darwin Tree of Life biodiversity genomics project. P Lo S Biol. 2023;21:e 3001972.36689552 10.1371/journal.pbio.3001972 PMC 9894559 · doi ↗ · pubmed ↗
