Investigating within-host population diversity of Cryptosporidium parvum using BlooMine
Arthur V. Morris, Thomas Connor, Justin Pachebat, Martin Swain

TL;DR
This study uses a new method to analyze genetic diversity in Cryptosporidium parvum, revealing how infections within a host can lead to subtype variation and evolution.
Contribution
The study introduces a novel alignment-free tool, BlooMine, to accurately detect genetic diversity in C. parvum from NGS data.
Findings
Cattle samples showed 2.3-fold higher odds of multiple Gp60 subtypes compared to human samples.
Subtypes previously thought host-specific were found in unexpected hosts, suggesting greater host plasticity.
Allele co-occurrence patterns indicate both replication slippage and within-host competition.
Abstract
Investigating multiplicity-of-infection (MOI) in pathogen populations is central to understanding within-host evolutionary dynamics. In Cryptosporidium parvum, MOI may play a pivotal role in diversification due to the parasite’s sexually recombinogenic life cycle, which occurs entirely within a single host. Subtyping of C. parvum typically relies on variation at short tandem repeat (STR) loci - such as that found in the 60kDa surface glycoprotein gene, Gp60 - but accurate profiling of these regions from next-generation sequencing (NGS) data remains technically challenging. Here, we apply BlooMine, an alignment-free, Bloom filter based tool that isolates STR containing reads via a novel pseudo-alignment strategy robust to structural and sequence variation. Using this approach, we analyse the full publicly available C. parvum Illumina dataset to quantify polyclonality at the Gp60 locus.…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7- —Knowledge Economy Skills Scholarship - KESS II
- —Wellcome Trust
- —https://doi.org/10.13039/100012068Health and Care Research Wales
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParasitic Infections and Diagnostics · Toxoplasma gondii Research Studies · Trypanosoma species research and implications
Introduction
Cryptosporidium is a single celled eukaryotic parasite which causes substantial mortality and morbidity worldwide, with it being implicated in over 200,000 deaths annually [1]. Extracting sufficient DNA from clinical samples of Cryptosporidium for the purpose of whole genome analysis is a challenge due to the often low faecal mass provided in clinical samples, and low oocyst count of these samples [2, 3]. Despite recent advances being made both experimentally and computationally, there is currently no workflow which allows for routine generation and analysis of whole genome data from clinical Cryptosporidium samples [4–6].
Multiplicity of infection (MOI) is a well documented observation within the natural world, and has been reported widely in parasitic protozoan infection. In 1999, Lord et al. reported that the majority of adults infected with Plasmodium falciparum were host to more than five distinct strains [7]. MOI may arise by two principal processes: infection by multiple strains; and in-host divergence into genetically distinct sub-populations. The presence of multiple clonal populations of a parasite in a single host (referred to as ‘polyclonality’) has a significant impact on their incidence, transmission, and virulence, factors of enormous importance to public health. An incomplete understanding of parasite polyclonality leads to potentially inaccurate assumptions about the epidemiology, clinical presentation and outlook of the infection. The virulence experienced by a host which exhibits polyclonality is a product of the interactions between the different clonal populations, resulting in an overall virulence of anywhere from greater than the most virulent to less than the least virulent, depending on various biological factors [8, 9]. Efforts have been made to identify general patterns when ecological and evolutionary theory are applied to within-host dynamics, leading to the generalisation that ‘basic ecological rules govern the outcome of co-infection across a broad spectrum of parasite taxa’ [10]. The investigation into the impact of MOI on a host relies on the accurate and reliable detection and discrimination of the clonal component populations of the mixture. With modern methods of DNA isolation and purification, and the fall in price of using high-throughput Next-Generation sequencing (NGS) technologies to perform whole genome sequencing (WGS), investigating MOI by mining WGS reads has become a more feasible approach [2, 3]. Furthermore, it allows for unrestricted investigation of local heterogeneity at any position within the genome, something which is simply not possible to carry out using conventional molecular wet-lab techniques. It is both biologically plausible (due to unrestricted sexual recombination between sub-populations), and there is strong evidence, that mixed-infections can give rise to novel strains of Cryptosporidium spp., which may be genotypically and phenotypically distinct from their parents, representing a serious challenge to public health control services which rely on epidemiological surveillance [11–14]. It is therefore essential that polyclonality can be characterised in order to improve epidemiological surveillance, which is essential in outbreak investigation and subsequent interruption of transmission - a critical component in the fight against cryptosporidiosis and other parasitic diseases.
Alignment-free sequence analysis has the capacity to resolve the significant bottleneck between data generation and analysis which has been well documented in ‘omics research, as it may be both faster and more computationally efficient than sequence alignment based approaches [15]. This allows for the development of tools which can mine information from raw sequencing reads, obviating the significantly time-consuming and computationally expensive task of assembly. There are a great many tools which employ alignment-free methods to identify nucleotide and protein sequence similarity, developed to resolve problems in many disciplines within Bioinformatics. However, the majority of the tools developed to resolve these from pathogen clinical data utilise single nucleotide polymorphism (SNP) data, for which variant callers are highly optimised. Variable Number Tandem Repeat (VNTR) loci have been regions subject to heavy interrogation, and have been used to define subtypes and variants in many pathogen species [16]. Despite great advances in the use of NGS data for assembly-free genomic analysis, low concordance of variant calling pipelines has been reported, and such pipelines are often incapable of resolving VNTR repeat subunit copy number even across small VNTRs which can be captured in a single read, which remains a troubling issue [17].
Here we present BlooMine: a novel raw read mining tool developed to facilitate quick and computationally efficient local analysis of sequences captured within raw reads generated by whole or partial genome sequencing projects. It utilises Bloom filters, a highly space efficient probabilistic data structure, and a novel pseudo-alignment algorithm to perform set membership queries and infer sequence homology [18]. We use BlooMine to investigate MOI in clinical samples of Cryptosporidium parvum, demonstrating that it can be used to perform both in silico subtyping from NGS reads, and local genomic heterogeneity analysis, facilitating both informed epidemiological investigations in public health, and robust hypothesis testing critical to the development of novel approaches of transmission interruption.
Materials and methods
Implementation
BlooMine is written using C++ and Python 3. It consists of two modules:
- BlooMine_screen: Screen a FASTQ dataset for reads which contain a target sequence.
- BlooMine_demix: Analyse reads for target sequence heterogeneity. The BlooMine_screen module consists of 3 steps:
- BlooMine_gen: Generate a Bloom filter from the target sequence.
- BlooMine_FPscreen: Carry out a first pass screen of reads using the target Bloom filter.
- BlooMine_SPaln: Carry out a second pass target sequence pseudo-alignment step.
BlooMine_gen
A target sequence, T is used to generate a kmer array, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K_T^k$$\end{document} . This is then passed through hash functions to generate a bit signature for T, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B_t$$\end{document} . The number of hash functions can be adjusted to meet user type-I and type-II error rate requirements.
BlooMine_FPscreen
Reads within the FASTQ file are screened against the Bloom filter, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B_t$$\end{document} , generated during the BlooMine_gen process to detect the presence of the target sequence. This step constitutes the first pass screening. A kmer array is generated from each read within the FASTQ, which is then hashed using the Bloom filter hash functions. The subsequent bit signature generated by this hashing process is screened against \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B_t$$\end{document} to infer set membership. The detection of a hit increments the score by 1. If the overall score for the read exceeds the kmer intersection threshold, sequence similarity is inferred and the read returned for further processing. Trivially (without a description of Bloom filter functionality) the set of all reads which exceed the kmer intersection threshold is therefore
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} R' = \{ r \in R\ \big \vert \ |K_T^k \cap K_r^k| \ge \theta \} \end{aligned}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K_r^k$$\end{document} is a kmer array of the read sequence, T is the target sequence, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document} is kmer intersection threshold.
BlooMine_SPaln
BlooMine_SPaln filters reads using a novel soft-alignment algorithm (Algorithm 1) based on positional k-mer mapping. For each read in the filtered set \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R'$$\end{document} , the positions of all matching kmers are identified and used to construct a sparse positional vector aligned to the read.Algorithm 1 BlooMine SPaln Second-pass screen: Alignment-free alignment inference
From this vector, gapped alignment regions are inferred by grouping together adjacent kmer hits that are separated by no more than a threshold number of positions. These are defined as subalignment chunks. Each chunk is scored using a gap-sensitive function to determine the optimal alignment region.
The gap threshold \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega$$\end{document} is defined as:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \omega = \left\lceil \frac{hk - g}{n} \right\rceil \end{aligned}$$\end{document}This value is chosen such that a single matching kmer followed by a gap of size \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega$$\end{document} results in a total alignment score near zero, thereby discouraging long, sparse alignments.
Each chunk c is then scored using the following scoring algorithm. Let chunk \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c=\{c_0,c_1,...,c_{|c|-1}\}$$\end{document} be a sequence of aligned positions (indexed by position in the read), we define the score of the chunk \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_c$$\end{document} as:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} s_c = \sum \limits _{i=0}^{|c|-1}\delta (c_i,c_{i-1}) \end{aligned}$$\end{document}Where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta (c_i,c_{i-1})$$\end{document} is defined as:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \delta (c_i,c_{i-1}) = \left\{ \begin{array}{ll} +h & \text {if } c_i \in K_T \\ -g & \text {if } c_i \notin K_T \wedge c_{i-1} \in K_T \\ -n & \text {if } c_i \notin K_T \wedge c_{i-1} \notin K_T \end{array}\right. \end{aligned}$$\end{document}Where:
k = the size of each kmer (default=7).
h = the value the score is incremented by when a hit is triggered (10.0).
g = the value the score is penalised by when a gap is opened (15).
n = the value the score is penalised by when a gap is extended (7.0).
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_i$$\end{document} = position i within the aligned chunk c.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K_T$$\end{document} = the kmer array of target sequence T.
Alignment regions are split at any gap of size \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge \omega$$\end{document} . All contiguous sub-sequences of alignment chunks (i.e. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(n(n+1)/2)$$\end{document} combinations for n chunks) are evaluated using the scoring function. The highest-scoring alignment is selected. If this score exceeds the minimum score threshold MST (see Minimum score threshold section), the read is deemed to contain the target sequence. Figure 1 presents a graphical representation of this process using a full target region consisting of two flanking regions and a polymorphic STR locus (see Fig. 2 for an example model of an STR target region).Fig. 1. Workflow for identifying target regions in short reads
Fig. 2A model of a target sequence, showing the target region consisting of a TCA repeat region, and flanking regions. Where... refers to an arbitrary number of the preceding repeat motif
Minimum score threshold
Due to the polymorphic nature of certain genomic regions (e.g. VNTRs), traditional alignment algorithms such as Smith-Waterman and Needleman–Wunsch often fail to produce reliable results due to their assumption of globally consistent match/mismatch patterns [19, 20]. These methods are not robust to high densities of SNPs, insertion/deletion events (INDELs), or rearrangements.
To address this, we propose an approximate alignment scoring scheme based on k-mer matches across sliding windows. The key insight is that mismatches often cluster, particularly in low-complexity and repetitive regions, rather than present as uniformly distributed across a sequence.
Let s be a quest sequence of length |s|, and let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document} be the expected mismatch spacing (i.e. the average number of bases between mismatches). The mismatch rate is then defined as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon =\frac{1}{\kappa }$$\end{document} , and the expected number of mismatches in s is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|s| \cdot \epsilon$$\end{document} .
We define a sliding window of size \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k+\kappa -1$$\end{document} . The expected number of mismatches per window, e is therefore
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} e=\frac{k+\kappa -1}{\kappa } \end{aligned}$$\end{document}For example, with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=5$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa =10$$\end{document} , a window spans 14 bases and is expected to contain 1.4 mismatches.
Let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w=\frac{|K_t|}{k+\kappa -1}$$\end{document} be the number of such windows over the target sequence. The total expected mismatch count, M, is
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} M=e(w-1) \end{aligned}$$\end{document}The minimum alignment score is then calculated as
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} MST=h\cdot |K_t|-wg+nM \end{aligned}$$\end{document}This novel alignment algorithm uses concepts from both alignment and alignment-free sequence analysis approaches to resolve alignments from sequences which have undergone rearrangement or INDEL events, since regions flanking highly polymorphic VNTR regions may be poorly conserved. Furthermore, alignments are carried out on small sequences (single reads) potentially many thousands of times, which may render conventional hard-alignment approaches less efficient.
BlooMine_MOI
BlooMine functionality can be leveraged to investigate the presence of multiple alleles within a single FASTQ file, which may indicate MOI in clinical pathogen samples. Reads are filtered in two steps using probe sequences flanking the target region as proxy-targets. First, reads containing flank 1 are identified within the full dataset. Second, the resulting subset is further filtered for reads containing flank 2. The output is the set of reads which contain both flanks, and therefore the target region. These reads are then used to calculate the level of heterogeneity in the target region.
Probe sets for three major C. parvum Gp60 families (IIa, IId, IIc) were developed using sequences flanking the STR region. These probes were determined to cover a large portion of the C. parvum subtype space. To account for probe cross-matching between Gp60 families, the major Gp60 family was heuristically determined using probe set which yielded the largest number of read hits. Instances of multiple Gp60 family hits were handled using bespoke scripts which determined the best matching probe pair for each read.
MOI investigation of C. parvum
A dataset of all publicly available Cryptosporidium parvum (n=667) NGS samples was utilised in this study. Samples which were a product of artificial infection, or lab strains, were filtered out (n=38). The remaining samples (n=629) were screened using the highly polymorphic Gp60 microsatellite locus as a target. These samples were taken from Europe, North America, Africa, and Asia, and from 7 hosts: Cattle (Bos taurus), Human (Homo sapiens), Sheep (Ovis aries), Goat (Capra hircus), Water Buffalo (Bubalus bubalis), Bactrian camel (Camelus bactrianus) and Mouse (Mus musculus). The BlooMine MOI pipeline was run on each of these samples using sequences taken from conserved regions flanking the Gp60 locus.
BlooMine was executed using a false-positive rate of 0.0001, a FP-screen threshold of 35.0 and a SP-screen threshold ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document} ) of 4.0. All data were analysed on a laptop running Ubuntu v22.04.5, with 64Gb memory and using 12 threads.
Gp60 subtyping of Cryptosporidium
Subtyping of Cryptosporidium involves interrogation of a highly polymorphic VNTR (serine repeat) region within the Gp60 surface glycoprotein gene on chromosome 6 [21]. Due to its highly polymorphic nature in C. parvum, it has been used to perform epidemiological investigations in this species for many decades. Subtypes are elucidated by determining the fragment length and sequence content of the Gp60 allele (a variant of a target locus).
Here, allele naming follows a convention analogous to that used when reporting the Gp60 subtype of Cryptosporidium, which was thoroughly described by Robinson * et al* [22, 23]. A, G, and R are followed by integers representing the numbers of TCA, TCG and terminal ACATCA counts respectively. This yields an allele code like ’A15G2R1’. Gp60 family is determined by the sequence content of the Gp60 gene, and takes the form of a latinised alphanumeric code (e.g. IIa) which indicates both the species and Gp60 family of the sample, whereby the two major species, C. hominis and C. parvum are designated I and II respectively. This yields a full Gp60 subtype code which indicates the family and the allele present in any given sample (e.g. IIaA15G2R1). However, since the full Gp60 region was not utilised in determining the Gp60 family in this study (due to limitations of read length), we present the probe which exhibited the highest fidelity with each allele after the allele name. For example A15G2R1-IIa would refer to the A15G2R1 allele extracted using probes designed against sequences from Gp60 family IIa. This is done to distinguish it from the conventional nomenclature and ensure that the Gp60 family designations are not taken as definitive.
Allele filtering and PCR stutter artefact correction
To accurately characterise the allelic composition of STR loci from sequencing data, particularly the Gp60 locus, we employ a two-stage filtering process designed to exclude sequencing noise and PCR stutter artefacts.
Allele Filtering
If we consider the set of all reads which have been determined to contain a target region by BlooMine as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R''$$\end{document} , then we presume that all variants of the target sequence (’alleles’) are embedded within this read set. An allele count vector is computed for each allele within \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R''$$\end{document} as
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathcal {C}_{R''} = \left[ c_1, c_2, \dots , c_n \right] \end{aligned}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_i$$\end{document} is the number of reads in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R''$$\end{document} assigned to allele \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_i$$\end{document} (the allele count of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_i$$\end{document} ), and n is the total number of distinct alleles identified. From this, we define a normalised frequency vector:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathcal {F}_{R''} = \left[ f_1, f_2, \dots , f_n \right] = \left[ \frac{c_1}{\sum _{j=1}^n c_j},\; \frac{c_2}{\sum _{j=1}^n c_j},\; \dots ,\; \frac{c_n}{\sum _{j=1}^n c_j} \right] \end{aligned}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_i$$\end{document} is the relative frequency of allele \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_i$$\end{document} across all reads in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R''$$\end{document} (the allele frequency of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_i$$\end{document} ). The major allele is defined as the allele with the highest frequency in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {F}_{R''}$$\end{document} , and minor alleles are all other alleles.
For example, suppose we have 40 reads which fully cover the Gp60 region, including its flanking regions. This ensures that the STR locus is not clipped and allows for confident allele calling. The total allele count at this locus is therefore \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 40$$\end{document} , with observed counts:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathcal {C} = \left[ \begin{array}{l} a_1 : 20 \\ a_2 : 10 \\ a_3 : 6 \\ a_4 : 4 \end{array} \right] \end{aligned}$$\end{document}The corresponding frequency vector is:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathcal {F} = \left[ \begin{array}{l} a_1 : 0.5 \\ a_2 : 0.25 \\ a_3 : 0.15 \\ a_4 : 0.10 \end{array} \right] \end{aligned}$$\end{document}The major allele is the most frequent, in this case \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_1$$\end{document} , and all others are considered minor alleles. We then apply a naive hard-filter using a minimum count threshold \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_c$$\end{document} and frequency threshold \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_f$$\end{document} . Alleles that fall below either threshold are removed. If we assign the naive hard-filtering parameters in this examples as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_c=3$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_f=0.15, a_4$$\end{document} is filtered out due to its low frequency even though its count exceeds the minimum. This dual-threshold approach is functions to remove noise in both low and high depth scenarios: in high-depth datasets (e.g., \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 1000$$\end{document} ), noise may exhibit high counts but low frequency; in low-depth datasets, even a single erroneous read can represent a substantial proportion.
PCR Stutter Filtering
To further reduce artefacts arising from PCR amplification, we apply an additional stutter filter based on known dynamics of replication slippage in STR regions [24]. The most common stutter artefact is a loss of a single repeat unit (back-stutter; L1), followed less frequently by a loss of two repeat units (L2). Forward stutter and larger contractions are rare [25]. These stutter products are not sequencing errors, but true sequences generated from mis-amplified templates. We account for this by comparing the frequencies of minor alleles to that of the presumed template (major allele). For alleles \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_j$$\end{document} which are one or two repeat units shorter (thus potential L1 or L2 artefacts) than the major allele, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_M$$\end{document} , we compute a stutter ratio:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} s_j = \frac{f_j}{f_M} \end{aligned}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_j$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_M$$\end{document} are the frequency of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_j$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_M$$\end{document} respectively. We then assign a stutter ratio threshold for L1 artefacts ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_1$$\end{document} ), and an L2 stutter ratio threshold of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_2=R_1/2$$\end{document} . L1 and L2 candidates are considered artefacts if \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_j<R_1$$\end{document} or \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_j<R_2$$\end{document} respectively, and thus filtered out [26, 27]. For all other alleles (not plausible stutter products), only the naive filtering thresholds apply.
Continuing the example, suppose the filtered allele set is:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \left[ \begin{array}{l} a_1 : 20 \\ a_2 : 10 \\ a_3 : 6 \end{array}\right] \end{aligned}$$\end{document}with frequencies:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \left[ \begin{array}{l} a_1 : 0.5 \\ a_2 : 0.25 \\ a_3 : 0.15 \end{array}\right] \end{aligned}$$\end{document}Assume the following STR sequences:
- \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_1$$\end{document} : TCATCATCATCATCGTCG
- \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_2$$\end{document} : TCATCATCATCA (2 repeat unit loss)
- \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_3$$\end{document} : TCATCATCATCATCG (1 repeat unit loss)
Allele \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_2$$\end{document} is a candidate L2 stutter, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_3$$\end{document} is a candidate L1 stutter. Normalising the frequencies using Eq. 9 gives us:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} s_{a_2} = \frac{0.25}{0.5} = 0.5, \quad s_{a_3} = \frac{0.15}{0.5} = 0.3 \end{aligned}$$\end{document}If we assign aggressive stutter threshold parameters of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_1 = 0.34$$\end{document} (thus \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_2 = 0.16$$\end{document} ), \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_3$$\end{document} is filtered ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.3 < 0.34$$\end{document} ), but \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_2$$\end{document} is retained ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.5> 0.16$$\end{document} ), yielding the final filtered set:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathcal {C'} = \left[ \begin{array}{l} a_1 : 20 \\ a_2 : 10 \end{array}\right] , \quad \mathcal {F'} = \left[ \begin{array}{l} a_1 : 0.667 \\ a_2 : 0.333 \end{array}\right] \end{aligned}$$\end{document}Polyclonality rate in cattle and humans
Allele calls from the Gp60 locus were filtered to suppress PCR-stutter and low-support artefacts using a pre-specified parameter set: \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L1 = 0.15, L2 = 0.075$$\end{document} (i.e., \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L2 = L1/2$$\end{document} ), minimum allele count \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$= 3$$\end{document} , and minimum allele frequency \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$= 0.05$$\end{document} . Samples from hosts other than cattle (Bos taurus) or human (Homo sapiens) were excluded. A sample was classified polyclonal if \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 2$$\end{document} alleles passed all filters; otherwise monoclonal.Table 1. The host and continent stratified dataset used to conduct the CMH test. Values are given as: monoclonal count (polyclonal count)HumanCattleContinent totalEurope62 (3)77 (11)139 (14)North America163 (13)44 (8)207 (21)Total225 (16)121 (19)346 (35)
To control for geographic confounding and mitigate Simpson’s paradox, analysis was stratified by continent. Continents with fewer than two samples in either host were excluded a priori. The final analysis set comprised \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=381$$\end{document} cattle and human samples from Europe ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=153$$\end{document} ) and North America ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=228$$\end{document} ), of which 346 were monoclonal and 35 were polyclonal (see Table 1).
Within each retained continent, a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2\times 2$$\end{document} table (host \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} poly/monoclonal sample count) was constructed and per-continent odds ratios (ORs) for polyclonality (cattle vs. human) computed. This stratification controls for uneven geographic distribution, and mitigates Simpson’s paradox. Mantel-Haenszel pooled OR ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$OR_{MH}$$\end{document} ) with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95\%$$\end{document} confidence intervals were obtained across continents, and the null hypothesis \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_0:OR_{MH}=1$$\end{document} tested using the Cochran-Mantel-Haenszel (CMH) test (two-sided, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha =0.05$$\end{document} ).
Consistency of stratum specific ORs across continents was evaluated with the Breslow-Day (BD) test for homogeneity of stratum-specific ORs, and results interpreted under the criterion that BD \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p<0.10$$\end{document} indicates heterogeneity. Pooled results were interpreted with caution when heterogeneity was detected.
To assess the robustness of hypothesis testing under different filtering parameters, an exploratory threshold sweep over minimum allele count \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$= 1-30$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L1 = 0.01-0.30$$\end{document} (with L2 fixed at \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{L1}{2}$$\end{document} , and minimum allele frequency fixed at 0.05) was carried out. For each grid point, the continent-stratified CMH analysis was repeated and summarised by constructing a heatmap from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$OR_{MH}$$\end{document} with CMH p-values overlaid. This sweep is descriptive, with all inferences relying on pre-specified thresholds of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L1=0.15$$\end{document} and minimum allele count = 3, and as such no multiplicity adjustments were carried out.
To probe additional predictors, a multivariable logistic regression was fit with polyclonality as the outcome and covariates: host, continent, sequencing depth (summed Gp60 allele counts across each sample), major-allele length, and Gp60 family (IIa vs IId, IIc excluded due to zero polyclonal samples).
Allele co-occurrence
To investigate allele co-occurrence and mutual exclusivity within polyclonal datasets, samples were stratified by continent of origin, and allele pairs with geographical contiguity—defined as co-presence within at least one continent—were tested using the Cochran–Mantel–Haenszel (CMH) procedure. This approach controls for confounding due to uneven geographic distribution, yielding pooled odds ratios and p-values across strata, with inclusion restricted to allele pairs exhibiting geographic contiguity (i.e. allele pairs occurring within at least one shared continent) and meeting a minimum per-continent occurrence threshold of three samples per allele. The Breslow-Day test was used to detect heterogeneity in odds-ratio across different continents. Fisher’s exact test was not applied due to its sensitivity to highly abundant alleles, which can lead to inflated estimates of both exclusivity and co-occurrence signal strength in globally unstratified datasets.
Results
Dataset summary
Of the 629 C. parvum genomic NGS datasets used, 592 fully covered the Gp60 locus with at least 3 reads and a minimum allele frequency of 0.05. Under the first pass filter only (without stutter filtering), 398 were monoclonal and 194 were polyclonal, exhibiting at least 2 alleles at the Gp60 locus.Fig. 3. Major/minor allele presence across all 51 samples which were resolved as polyclonal after applying a 0.15 stutter filter. Blue = major allele; Red = minor allele
Fig. 4. The number of samples which exhibit each allele as major or minor allele
Polyclonality prevalence
An L1 stutter filter of 0.15 was used to investigate allele co-occurrence and host/geographical distributions. Using this stutter filter, 51/592 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$8.6\%$$\end{document} ) polyclonal samples were resolved: 18 human, 31 cattle, and one each from sheep and mouse (see Fig. 3). All polyclonal samples were from North America ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=23$$\end{document} ), Europe ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=15$$\end{document} ) and Asia ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=13$$\end{document} ).
In total, 54 unique alleles were detected across clonal and polyclonal samples (Fig. 4); eight occurred only as minor alleles (A14G1R1-IIa, A17G2R1-IIa, A20G1R1-IIa, A22G1-IId, A23G1-IId, A23G1R1-IIa, A25G1-IId, A26G1-IId).
A15G2R1-IIa was the most common allele in the entire dataset ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=183$$\end{document} ) both as a major allele ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=158$$\end{document} ) and minor allele ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=11$$\end{document} ). It was the most frequent major allele in human ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=109$$\end{document} ), sheep ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=5$$\end{document} ) and goat ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=5$$\end{document} ) samples, and the second most common allele in cattle ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=64$$\end{document} ). A15G2R1-IIa comprised \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$28\%$$\end{document} of all alleles detected across the dataset and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$44\%$$\end{document} of human samples. The most common major allele in cattle was A19G1-IId ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=141$$\end{document} ) was found in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$45\%$$\end{document} of cattle samples. This allele was also found in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$80\%$$\end{document} ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=143$$\end{document} ) of all Asian samples, which were dominated by Chinese cattle samples, and a single North American sample.
Multiple host-specific alleles were found in humans ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=18$$\end{document} ), cattle ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=10$$\end{document} ), sheep ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=2$$\end{document} ), and mouse ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=1$$\end{document} ). Six of these host-specific alleles were only detected in polyclonal human (A10G2-IIc, A12G1-IIc*, A14G2R1-IIa*, A16G3R2-IIa, A17G2R2-IIa, and A18G3R2-IIa*) and seven in cattle (A14G1R1-IIa, A21G1R1-IIa, A22G1R1-IIa*, A23G1R1-IIa, A23G1-IId*, A25G1-IId*, A27G1-IId*) samples, of which six presented only as minor alleles (denoted by *).
Four samples showed cross-family polyclonality. Three were North American human infections with concurrent IIa and IIc alleles (including one IIa/IIa/IIc triple mixture). The fourth case was a mouse sample in which an atypical allele was recovered using the IId probe set. Subsequent alignment of this cryptic allele against a curated database of Gp60 sequences identified its closest match as Cryptosporidium tyzzeri, subtype family IXa.
Host differences in polyclonality
After filtering, within-continent ORs (cattle vs. humans) were 3.23 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95\%$$\end{document} CI \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.87-12.02$$\end{document} ) in Europe and 1.8 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.72-4.49$$\end{document} ) in North America, corresponding to polyclonality rates of 0.116 vs. 0.039 and 0.140 vs. 0.083, respectively. The Mantel–Haenszel pooled estimate across continents was \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$OR_{MH} = 2.26$$\end{document} ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95\%$$\end{document} CI \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.08-4.71$$\end{document} ; CMH \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p = 0.048$$\end{document} ), with no evidence of heterogeneity (BD \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p = 0.467$$\end{document} ). Human IIa/IIc co-infections were observed only in North America and therefore account for the lower continent-specific OR observed there. Cattle had \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 2.3$$\end{document} -fold higher odds of harbouring a polyclonal C. parvum infection than humans after conditioning on continent.Fig. 5. Signed log odds ratios from a continent-stratified Cochran–Mantel–Haenszel test comparing the frequency of polyclonality in cattle versus human samples across a grid of allele count and stutter filter ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} ) thresholds. This parameter sweep was conducted to evaluate the robustness of the observed host effect. The black box denotes the pre-specified filtering regime used in all inferential analyses. OR = odds ratio; IQR = interquartile range
In the multivariable logit model, the host effect remained significant after adjustment for continent, major-allele length, depth, and Gp60 family (adjusted OR for cattle vs. human \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx 2.69$$\end{document} ; \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95\%$$\end{document} CI \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.21-5.95$$\end{document} ; \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p = 0.015$$\end{document} ). Sequencing depth ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p = 0.253$$\end{document} ), continent (North America vs Europe \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p = 0.266$$\end{document} ) were not associated with polyclonality in this model. Major-allele length showed a positive association (per-unit \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$OR = e^{0.1421} \approx 1.15, 95\%$$\end{document} CI \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.08-1.23$$\end{document} ; \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p < 0.001$$\end{document} ); interpreted over the interquartile range, this corresponds to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$OR_{IQR} \approx 2.43$$\end{document} . We consider plausible biological and technical explanations for this signal in the Discussion. Gp60 family ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=0.729$$\end{document} ) was not associated with polyclonality, however, because only IIa and IId were observed across both continents (IIc was only detected in North American samples), the family contrast is sparse and should be interpreted cautiously.
Robustness to filtering choices was evaluated with a parameter sweep over minimum allele count ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-30$$\end{document} ) and L1 threshold ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.01-0.30$$\end{document} ; L2 fixed at L1/2, min \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$AF = 0.05$$\end{document} ). Across a plateau surrounding the pre-specified thresholds (min count \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 3, L1 \ge 0.12$$\end{document} ), \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$OR_{MH}$$\end{document} consistently exceeded 1, with many settings reaching CMH \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p < 0.05$$\end{document} (Fig. 5). As expected, very lenient thresholds (min count \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$< 3$$\end{document} and/or \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L1 < 0.05$$\end{document} ) diluted the signal, consistent with increasing stutter noise rather than a change in the underlying association. Together, these analyses support a continent-adjusted excess of polyclonality in cattle relative to humans, and show that the conclusion is stable to reasonable choices of filtering parameters.
Allele distribution and co-occurrence
Within-sample allele associations were tested across 140 allele pairs using a continent-stratified CMH model (colour = signed \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-log10$$\end{document} p, bubble size = |log2(OR)|). Three pairs showed statistically significant positive associations (co-occurrence): A19G2R1-IIa with A18G2R1-IIa ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|log2(OR)|=6.96$$\end{document} ), A10G2-IIc with A16G2R1-IIa ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|log2(OR)|=4.80$$\end{document} ), and A16G2R1-IIa with A17G2R1-IIa ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|log2(OR)|=2.87$$\end{document} ). The latter reflects a moderate effect size. The cross-family association between A10G2-IIc and A16G2R1-IIa is likely a result of the low abundance of A10G2-IIc in combination with its tendency to present as a minor allele within this dataset. In contrast, several pairs exhibited significant negative associations (mutual exclusivity), notably A5G3-IIc with A15G1R1-IIa ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|log2(OR)|=-5.70$$\end{document} ), A19G1-IId with A20G1-IId ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|log2(OR)|=-8.13$$\end{document} ), and A15G1R1-IIa with A20G1-IId ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|log2(OR)|=-4.39$$\end{document} ). The highly prevalent A15G2R1-IIa showed many statistically significant exclusions with other alleles; however, most of these had small effect sizes (small bubbles indicating low |log2(OR)|).Fig. 6. Allele associations conditioned on continent using the Cochran-Mantel–Haenszel (CMH) test. Colour = signed \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-log10($$\end{document} one-sided p); red indicates positive association (co-occurrence), blue negative (mutual exclusivity). Bubble area is proportional to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mid log2(OR)\mid$$\end{document} where OR is the CMH pooled odds ratio. Grey bubbles mark non-significant pairs ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p \ge 0.05$$\end{document} )
Discussion
Cattle are more likely to exhibit multiple Gp60 alleles than humans
The stratified analysis revealed that cattle are substantially more likely than humans to harbour multiple Gp60 alleles (approximately 2.3-fold higher odds), which is likely to reflect a greater Gp60 subtype polyclonality rate. Inference is strengthened by the persistence of the host effect after adjustment for continent, sequencing depth, and Gp60 family in a multivariable model, indicating that the contrast is not readily explained by measured confounding. An association was observed between polyclonality and major-allele length, with longer alleles exhibiting increased odds of within-sample diversity. This effect was independent of host: neither a per-continent Mann–Whitney U test nor an OLS model adjusting for host, continent, and their interaction detected any meaningful difference in major-allele length between humans and cattle (Europe: \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=0.60$$\end{document} , North America: \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=0.95$$\end{document} , OLS \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=0.48$$\end{document} , adjusted \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2=-0.001$$\end{document} ). This length effect is biologically plausible, as longer STRs are more prone to slipped-strand mispairing during replication. However, because the generation of PCR slippage artefacts share a similar mechanistic origin, distinguishing biological from technical sources of variation in longer alleles is non-trivial. Accordingly, the observed association should be interpreted with caution, and future studies leveraging high-fidelity amplification or long-read sequencing may help disentangle these effects.
Because estimates of multiplicity at STR loci are sensitive to artefactual alleles, a conservative, pre-specified filtering scheme was adopted to suppress PCR stutter. An L1 near 0.15 is widely used in forensic STR work [28, 29], although lower values are sometimes chosen [27, 30]. Importantly, those thresholds were developed for fragment sizing rather than NGS allele frequencies, so they should be treated as heuristic. To assess sensitivity, we performed an exploratory sweep over allele-count and L1 values. The cattle-human contrast remained directionally stable across a neighbourhood around the pre-specified settings, whereas very lenient filters attenuated the signal in a manner consistent with rising stutter noise rather than a biological reversal. We therefore base formal conclusions on the pre-specified threshold set and view the sweep as a robustness check rather than a multiple-comparison exercise.
The elevated polyclonality observed in cattle, combined with the association between major allele length and polyclonality, suggests two non-exclusive mechanisms underlying within-host diversity at the Gp60 locus:
- In-host diversification via slipped-strand mispairing, which generates novel alleles during replication. This mechanism is supported by the observed positive association between major-allele length and polyclonality, consistent with the known propensity of longer STR tracts to undergo slippage.
- Superinfection from multiple sources, resulting in the co-occurrence of genetically distinct Gp60 alleles within the same host. This is supported by the markedly higher polyclonality rates observed in cattle, which are likely driven by intense transmission pressure in agricultural settings, high environmental oocyst burdens, and repeated exposure via shared water or fomites.
Together, these findings point to both de novo generation of allelic variants and repeated exogenous introduction as contributors to Gp60 polyclonality. From a public-health perspective, frequent mixed infections in livestock may increase the risk of zoonotic spillover and facilitate genetic exchange between divergent lineages, underscoring the importance of targeted surveillance at the animal–human interface.
Gp60 subtyping interrogates a single hypervariable locus, and consequently mixed infections detected here may under-represent genome-wide diversity. Longitudinal sampling and whole-genome assays will be important to confirm the host contrast, quantify within-host dynamics, and determine whether polyclonality in cattle reflects repeated exposure, prolonged carriage, or enhanced within-host diversification.
Allelic associations, host plasticity, and geographic structure
Analysis of allele co-occurrence among polyclonal samples, conditioned on continent of origin, revealed both positive and negative associations (Fig. 6). Several positively associated pairs differed by a single repeat unit, consistent with recent divergence via replication slippage or co-transmission under high-exposure conditions. In contrast, multiple allele pairs exhibited statistically significant mutual exclusivity, including cross-family exclusions (e.g., A5G3-IIc with A15G1R1-IIa). These were detected within continents, suggesting that geographic segregation alone cannot explain the pattern. The most prevalent allele, A15G2R1-IIa, showed widespread exclusion of other alleles, likely reflecting a combination of high baseline frequency and genuine competitive incompatibility. These patterns could arise from within-host competition (e.g., differential fitness, immune evasion, or niche occupancy) or from biased introduction histories; distinguishing these mechanisms will require replication at additional loci, whole-genome metrics, and functional follow-up.
While artefactual contributions from PCR stutter cannot be fully excluded, filtering parameters and robustness checks across a broad threshold sweep support the qualitative validity of these findings. Nonetheless, because PCR stutter and slipped-strand mispairing can produce indistinguishable one-repeat variants at the Gp60 locus, complete removal of stutter-derived artefacts cannot be guaranteed. Consequently, caution is warranted when interpreting associations involving near-neighbour repeat variants due to the possibility of residual stutter artefacts.
Subtype distribution patterns were broadly consistent with established host associations and geographic trends. A15G2R1-IIa was the most frequently detected allele across the dataset, predominating in human and small ruminant samples, and aligning with its previously reported global ubiquity and non-host specificity [31–33]. In contrast, A19G1-IId dominated cattle samples from Asia, although this signal likely reflects sampling bias, as all Asian sequences originated from Chinese cattle. North American and European samples exhibited considerable overlap in subtype composition but also included region-specific alleles such as A15G2R2-IIa in North America and A17G1R1-IIa in Europe. A subset of alleles traditionally regarded as host-restricted displayed unexpected distributions. Notably, A19G2R1-IIa, previously described as cattle-associated [34], was recovered from 12 human samples in this dataset. These findings indicate that certain Gp60 subtypes may exhibit broader host plasticity than previously recognised, and underscore the value of continued genomic surveillance to track potential shifts in host adaptation.
MOI may act as a ‘genetic crucible’ for diversification
Across continents, A15G2R1-IIa was widespread in humans and small ruminants, whereas A19G1-IId dominated cattle samples taken in China. The detection of IIa/IIc co-infections in North American humans and IId co-existing with a C. tyzzeri IXa-like allele in a mouse highlights the potential for subtypes, subspecies, and even entirely different species with different host affinities to undergo recombination through sexual reproduction. We hypothesise that such mixed infections, a necessary prerequisite for recombination, may act as a genetic crucible, facilitating diversification through within-host recombination. Such recombinant populations of Cryptosporidium spp. are likely to present substantial challenges in the control of this parasite worldwide. These results warrant an investigation into the association between MOI and recombination on larger scale, multi-continent datasets, and using multiple markers or genomics to robustly test this hypothesis.
Technical limitations of BlooMine
The primary limitation of BlooMine is its reliance on the complete query (target sequence + flanking regions) being captured by enough reads to carry out MOI analysis. Larger query regions will require larger read sizes which are not produced by illumina sequencing platforms. However, the advent and employment of long-read technologies will facilitate the analysis of longer and more complex repeat rich regions, with which polyclonality may be characterised. Future work on BlooMine will support the analysis of formats commonly utilised by these long-read sequencing platforms.
Conclusion
Polyclonality in Cryptosporidium parvum shapes transmission, host adaptation, and diversification. Continent-stratified analyses indicate cattle have higher odds of multi-subtype Gp60 infections, consistent with the hypothesis that intensive agriculture elevates transmission and, in turn, polyclonality. Patterns of allele co-occurrence may suggest rapid STR-mediated microevolution, while mutual exclusivity between certain alleles points to possible competitive exclusion. Some subtypes exhibited unexpected host distributions, challenging assumptions about host restriction. These findings raise the hypothesis that elevated polyclonality may act as a genetic crucible, facilitating within-host recombination and subtype diversification. This hypothesis is consistent with the observed allele diversity and co-occurrence patterns we report here, and should be rigorously explored in future multi-locus or whole-genome studies. By enabling high-resolution, alignment-free detection of STR variation from NGS data, BlooMine provides a scalable tool for investigating within-host diversity and provides a framework for future MOI surveillance and evolutionary studies.
Supplementary Information
Supplementary Material 1
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Sow SO, Muhsen K, Nasrin D, Blackwelder WC, Wu Y, Farag TH, et al. The burden of cryptosporidium diarrheal disease among children 24 months of age in moderate/high mortality regions of sub-saharan africa and south asia, utilizing data from the global enteric multicenter study (gems). P Lo S Negl Trop Dis. 2016;10(5):e 0004729. 10.1371/journal.pntd.0004729.10.1371/journal.pntd.0004729 PMC 487881127219054 · doi ↗ · pubmed ↗
- 2Morris AV, Robinson G, Chalmers R, Cacciò S, Connor T. Parapipe: a pipeline for parasite next-generation sequencing data analysis applied to Cryptosporidium. Access Microbiol. 2025;7(8):000993.v 3. 10.1099/acmi.0.000993.v 3. PMID: 40988887; PMCID: PMC 12451305.10.1099/acmi.0.000993.v 3PMC 1245130540988887 · doi ↗ · pubmed ↗
- 3Swain MT, Vickers M. Interpreting alignment-free sequence comparison: what makes a score a good score? NAR Genomics Bioinforma. 2022 09;4(3):lqac 062. 10.1093/nargab/lqac 062.10.1093/nargab/lqac 062PMC 944250036071721 · doi ↗ · pubmed ↗
- 4Leclair B, Frégeau CJ, Bowen KL, Fourney RM. Systematic analysis of stutter percentages and allele peak height and peak area ratios at heterozygous str loci for forensic casework and database samples. J Forensic Sci. 2004;49(5):968–80.15461097 · pubmed ↗
- 5Lichtmannsperger K, Harl J, Freudenthaler K, Hinney B, Wittek T, Joachim A. Cryptosporidium parvum, Cryptosporidium ryanae, and Cryptosporidium bovis in samples from calves in Austria. Protozoology. 2020:4291–5. 10.1007/s 00436-020-06928-5/Published.10.1007/s 00436-020-06928-5PMC 770448633057813 · doi ↗ · pubmed ↗
