hmmibd-rs: An enhanced hmmIBD implementation for parallelizable identity-by-descent detection from large-scale Plasmodium genomic data

Bing Guo; Stephen F. Schaffner; Aimee R. Taylor; Timothy D. O’Connor; Shannon Takala-Harrison

PMC · DOI:10.21203/rs.3.rs-7004070/v1·July 2, 2025

hmmibd-rs: An enhanced hmmIBD implementation for parallelizable identity-by-descent detection from large-scale Plasmodium genomic data

Bing Guo, Stephen F. Schaffner, Aimee R. Taylor, Timothy D. O’Connor, Shannon Takala-Harrison

PDF

Open Access

TL;DR

The paper introduces hmmibd-rs, a faster and more accurate version of hmmIBD for detecting genetic similarities in large Plasmodium genomic datasets.

Contribution

An enhanced, parallelized implementation of hmmIBD in Rust with support for non-uniform recombination rates.

Findings

01

hmmibd-rs reduces IBD detection time by ~100x using multi-threading on large Plasmodium datasets.

02

Incorporating non-uniform recombination rates improves IBD breakpoint accuracy and reduces false positives/negatives.

03

The tool enables single-day IBD analysis of large malaria genomic datasets like MalariaGEN Pf7.

Abstract

Identity-by-descent (IBD), which describes recent genetic co-ancestry between pairs of genomes, is a fundamental concept in population genomics. It has been used to estimate genetic relatedness, detect selection signals, and understand population demography. The IBD detection method hmmIBD demonstrates high accuracy in inferring IBD segments between haploid genomes, including Plasmodium falciparum, and is widely used in malaria genomic surveillance. However, the current single-threaded implementation of hmmIBD does not utilize the full capacity of multi-processor computers, making it difficult to apply to large data sets, and does not accommodate non-uniform recombination rates across the genome. We developed an enhanced implementation of hmmIBD in the Rust programming language, named hmmibd-rs, which leverages multi-threaded computing to parallelize IBD inference over genome pairs and…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Plasmodium falciparum

Diseases1

malaria

Figures3

Click any figure to enlarge with its caption.

a](#F1)), due to the more compact memory representation of the genotype matrix and reduced disk read and write operations in *hmmibd-rs*. When multithreading is enabled, the performance of *hmmibd-rs* is almost linear with respect to the number of threads and the number of genome pairs. To test the performance on a large data set, we ran *hmmIBD* and *hmmibd-rs* on simulated *P. falciparum*-like genomic data with a sample size of up to n = 30,000, which is the same order of magnitude as the MalariaGEN Pf7 data set (n > 21,000). Our new implementation completed the IBD detection from the simula

a](#F2) and [b](#F2)). When focusing on IBD segments ≥ 2 cM and using the average recombination rate (*hmmIBD*), we found that the number of ends (breakpoints) of the detected IBD segments decreases for recombination hot spots and increases for cold spots when compared to those using the true non-uniform rates (*hmmibd-rs*) ([Fig. 2c](#F2)). Consistently, the error rates (false negative rates and false positive rates, [Fig. 2e](#F2) and [f](#F2)) and deviation from the true IBD coverage pattern ([Fig. 2g](#F2)) were significantly higher in *hmmIBD* results than in *hmmibd-rs*. The differences

a](#F3)), with the majority of time spent on *bcftools* for the initial genotype filtering step. As a proof of data show signals of positive selection ([Fig. 3b](#F3)) consistent with previous reports [[11](#R11), [14](#R14)].

Keywords

Identity-By-DescentHidden Markov ModelParallelizationPopulation GenomicsRecombination Rate MapPlasmodium

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Genetic diversity and population structure · Evolution and Genetic Dynamics

Full text

BACKGROUND

Identity-by-descent (IBD) refers to alleles or genomic regions (segments) that are identical between two individuals/genomes due to shared ancestry. For species with high recombination rates relative to mutation rates, such as malaria parasites, metrics that leverage recombination, e.g., IBD, capture finer-scale and higher-resolution dynamics in population demography [1–3]. IBD, inferred from malaria parasite genomic data, conveys important information about the recent history of populations, including genetic relatedness within and between populations, loci under natural selection, and time-specific demography (effective population size and population structure), thus playing a crucial role in malaria genomic surveillance [4–14].

Accurate detection of IBD segments often requires genotype data with sufficient marker density [4, 15, 16]. Species with a high ratio of recombination rate to mutation rate, such as the malaria parasite Plasmodium falciparum, tend to have a (common variant) marker density two orders of magnitude lower than that of humans [9, 15, 17–20]. Thus, an IBD detection algorithm robust to low marker density is crucial for high-recombining species. Our accompanying project suggests that hmmIBD stands out among many other IBD detection methods, uniquely providing high-quality IBD segment calls, including shorter segments, that allow for the generation of accurate results even for quality-sensitive inferences [16]. Despite its high accuracy and wide adoption in malaria research, hmmIBD could be improved in several areas: it uses only a single thread of multi-core CPUs, it assumes a uniform recombination rate across the genome, and it requires substantial preprocessing of input data into a specific format prior to analysis [5]. These limitations may prevent its wider application to larger data sets or to analyses that rely on a non-uniform genetic map [20, 21], an important consideration given the demand for analysis of large-scale whole-genome sequence (WGS) data [20] and opportunities to construct high-resolution recombination rate maps based on recent genetic crosses [22, 23].

In this work, we addressed these limitations by reimplementing and enhancing the Hidden Markov Model described in the original paper [5] using the Rust programming language. Our new implementation, hmmibd-rs, offers three key features: parallelized IBD inference, support for non-uniform recombination rates, and streamlined data management.

METHODS

We enabled parallelization for the HMM inference process at the level of single haploid genome pairs or groups of haploid genome pairs. In our reimplementation, we first modularized the original algorithm into multiple components, including data processing modules and different subcomponents of the HMM inference process. Relying on the modular structure, we then isolated the HMM inference process for a genome pair as the basic unit for parallelization. The original sequential HMM inference process, iterated over genome pairs, was converted into parallelizable tasks, utilizing the Rayon crate, a library designed for data parallelism [24]. In addition, we provide options to optimize memory usage based on parameters such as the maximum number of alternative alleles per locus and the output file buffer sizes.

We enabled non-uniform recombination rates for HMM inferences and IBD segment filtration using user-provided genetic maps. For HMM inference, we updated the term $[eqn]$ in the transition probabilities matrix [5] to the term $[eqn]$ , where ρ is the recombination rate per generation per bp, and dt is the physical distance between the t th marker and the t−1 th marker, and ct is the genetic distance between the two markers given by the user. As ct = ρ dt, our new implementation maps physical distances between markers to genetic distance, and directly uses genetic distance ct for HMM inference, thus removing the assumption of a uniform recombination rate along the genome. Besides the HMM inference steps, we also allow the usage of a recombination rate map for post-HMM inference, built-in IBD segment length filtration in genetic units, which is usually needed for analyses based on IBD segments since short IBD segment estimates are more error-prone and often filtered out before downstream analyses [15, 16, 25].

We improved the efficiency and ergonomics of input and output data management. We developed a simple, cross-platform auxiliary library bcf_reader in Rust and used it to process input data directly from the common genotype file format, binary variant call format (BCF). Based on this library, we implemented two main built-in functions. The first main function is to construct haploid genomes by replacing heteroallelic genotype calls in monoclonal samples with dominant alleles if the total allele depths (via the command line option min_depth) and the fraction of reads supporting the dominant alleles (via the options min_ratio and min_r1_r2) are high; otherwise, these are set to missing (as detailed in Supplementary Table 1). We note that the default criteria for determining whether to use dominant alleles or missing data are somewhat subjective: users may opt for stringent thresholds, such as setting all heteroallelic calls to missing data, which comes with the caveat of removing more sites and samples during the subsequent genotype filtering step, or more permissive ones to include all heteroallelic calls by using the allele with the highest read support, which may introduce substantial genotyping errors. The second built-in function is to iteratively filter samples and sites (based on the missingness of genotype calls) to obtain high-quality genotype data while retaining balanced numbers of markers and samples (Supplementary Table 1). The dominant-allele-based haploid genome construction from BCF files is a heuristic strategy for working with monoclonal samples. Haploid genomes from polyclonal infections may be inferred via external deconvolution programs like DEPloid or DEPloidIBD [26, 27] and provided to hmmibd-rs in a traditional table format used in hmmIBD.

We have included additional options, made available via the command line interface, to customize HMM parameters and data management parameters (e.g.,BCF processing, IBD filtering, output buffering, and optional suppressing) to allow the user to balance analytical needs and computational/storage efficiency. Additionally, hmmibd-rs is designed to be fully compatible with hmmIBD to facilitate the transition to hmmibd-rs, and, by default, generates both files for IBD segments and files for the fraction of sites IBD (estimates of genetic relatedness) like hmmIBD.

Methods used for simulation, measurement of computation time, and downstream analysis of detected IBD segments are similar to our previous analysis [14, 16], with further description provided in the Supplementary Methods. Details of these analyses can be found in the related pipeline and source code listed in the Availability of Data and Materials.

RESULTS

Our new implementation, hmmibd-rs, improves the computational efficiency of HMM IBD inference both by increasing single-thread performance and by enabling multithreading. When both hmmibd-rs and hmmIBD are forced to use a single thread, run times with hmmibd-rs were about 40% shorter than those of hmmIBD (Fig. 1a), due to the more compact memory representation of the genotype matrix and reduced disk read and write operations in hmmibd-rs. When multithreading is enabled, the performance of hmmibd-rs is almost linear with respect to the number of threads and the number of genome pairs. To test the performance on a large data set, we ran hmmIBD and hmmibd-rs on simulated P. falciparum-like genomic data with a sample size of up to n = 30,000, which is the same order of magnitude as the MalariaGEN Pf7 data set (n > 21,000). Our new implementation completed the IBD detection from the simulated data set in 1.3 hours using 128 threads with the AMD EPYC 9654 CPU model, whereas the single-threaded hmmIBD took an estimated 5.2 days to complete IBD detection (Fig. 1c). Additionally, when IBD segment length filtering options were used, the resulting file sizes were largely reduced (Supplementary Table 2). Thus, this new implementation, in this example, can accelerate the process by two orders of magnitude.

To understand how recombination rate misspecification affects IBD detection, we simulated genomes with a non-uniform recombination rate map that had the same mean recombination rate as the P. falciparum genome (Fig. 2a and b). When focusing on IBD segments ≥ 2 cM and using the average recombination rate (hmmIBD), we found that the number of ends (breakpoints) of the detected IBD segments decreases for recombination hot spots and increases for cold spots when compared to those using the true non-uniform rates (hmmibd-rs) (Fig. 2c). Consistently, the error rates (false negative rates and false positive rates, Fig. 2e and f) and deviation from the true IBD coverage pattern (Fig. 2g) were significantly higher in hmmIBD results than in hmmibd-rs. The differences between hmmIBD- and hmmibd-rs-derived IBD segments are largely reduced when using the true rates to calculate length used to filter segments inferred by hmmIBD, suggesting that an accurate recombination map is important for filtering IBD segments by length in genetic units (Supplementary Fig. 1). To test whether the recombination rate affects HMM inference, we analyzed unfiltered IBD segments when called with true (hmmibd-rs) and average rates (hmmIBD). We showed that rate misspecification indeed affects the detection of IBD breakpoints (Supplementary Fig. 2). The general underestimation of IBD breakpoints in recombination hotspots likely arises from two main factors: the IBD-merging bias in the HMM and the high error rates caused by low marker densities per genetic unit. In addition, this issue is aggravated by reduced state switching rate in the HMM and aggressive IBD removal in length filtering due to recombination rate misspecification (see Supplementary Note for more details). This finding highlights the importance of accurately characterizing the local recombination rate variation and using it to improve the detection and filtration of IBD segments.

Another obstacle in the hmmIBD-based analytical pipeline is the need for an hmmIBD-specific format for the input data, which adds a data formatting step to IBD detection and downstream analysis, which is particularly cumbersome if iterative filtering of samples and variants is done. We mitigated these issues by implementing optional, built-in, all-in-memory functions for iterative sample and site filtering, and for haploid genome construction using dominant alleles (Supplementary Table 1). We presented a simple pipeline based on these features, implemented in hmmibd-rs, to demonstrate its applicability to large-scale WGS data sets, from VCF/BCF files to IBD-based estimates, which includes genotype filtering and haploid genome construction (bcftools [28] and hmmibd-rs), IBD calling (hmmibd-rs), and IBD coverage calculation. We were able to finish IBD calling from the raw genotype call files within a single day using 64 threads CPU (Fig. 3a), with the majority of time spent on bcftools for the initial genotype filtering step. As a proof of data show signals of positive selection (Fig. 3b) consistent with previous reports [11, 14].

DISCUSSION

This study presents an improved implementation of hmmIBD with three important features for large-scale population genomics: high computational performance, optional recombination rate map specification, and improved data management. Compared to other probabilistic IBD (segment) detection methods popular in malaria research, including hmmIBD [5], isoRelate [6] and DEploidIBD [27], hmmibd-rs is the first attempt to leverage the memory-safe language Rust and its rich ecosystem to embrace the era of large-scale genomics by enabling computational parallelization, employing a standard input format and lowering difficulty of long-term software maintainability and further development to incorporate more complex models. Although hmmibd-rs has mainly been applied to Plasmodium data, it is expected to work with data from other sexually recombining species with high recombination rates for which haploid genomes can be constructed [29], which may include non-Plasmodium Apicomplexan species, such as Theileria parva [30], and insects such as Apis mellifera [31, 32] and fungi such as Saccharomyces cerevisiae [33, 34]. The new features of hmmibd-rs including its parallelizability and support of a non-uniform genetic map may allow the detection of inter-individual IBD segments from phased data on diploids, e.g. mosquitoes, by treating each phase of a diploid individual as a haploid genome. However, given advances in human genetics, a superior approach for diploids may also exist.

One caveat of our analyses of hmmibd-rs is the lack of reliable non-uniform recombination rates for empirical data sets of high-recombining species like P. falciparum, despite initial efforts to estimate either the average rate or high-resolution rate maps based on limited samples [22, 23]. Ongoing work is needed to estimate high-resolution recombination rate maps based on existing genetic cross data, as well as WGS data from large-scale population samples [21, 22, 35–37]. This will allow further evaluation of biases in IBD-based analysis due to the use of a simple average rate in empirical data. We also note that the optional built-in function that constructs haploid genomes based on dominant alleles is misspecified for polyclonal samples. Using more advanced genotype deconvolution tools, such as DEploid and DEploidIBD [26, 27], may better utilize polyclonal infections, although it may significantly increase the computational burden.

CONCLUSION

hmmibd-rs enhances the original IBD detection algorithm hmmIBD with key features that significantly accelerate IBD detection from large-scale genomic data and enable the incorporation of a genetic map for improved accuracy in genomes with non-uniform recombination. The new implementation allows for more efficient, accurate, and streamlined IBD-based analysis of Plasmodium genomes, which will contribute to the timely malaria genomic surveillance.

Supplementary Material

Supplementary Files

This is a list of supplementary files associated with this preprint. Click to download.

hmmibdrsmanuscriptsuppl.docx

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Neafsey DE, Taylor AR, Mac Innis BL. Advances and opportunities in malaria population genomics. Nat Rev Genet. 2021;22:502–17.33833443 10.1038/s 41576-021-00349-5PMC 8028584 · doi ↗ · pubmed ↗
2Camponovo F, Buckee CO, Taylor AR. Measurably recombining malaria parasites. Trends Parasitol. 2023;39:17–25.36435688 10.1016/j.pt.2022.11.002PMC 9893849 · doi ↗ · pubmed ↗
3Guo B, Rowley E, O’Connor TD, Takala-Harrison S. Potential and pitfalls of using identity-by-descent for malaria genomic surveillance. Trends Parasitol. 2025;41:387–400.40263027 10.1016/j.pt.2025.03.012PMC 12070291 · doi ↗ · pubmed ↗
4Taylor AR, Jacob PE, Neafsey DE, Buckee CO. Estimating relatedness between malaria parasites. Genetics. 2019;212:1337–51.31209105 10.1534/genetics.119.302120 PMC 6707449 · doi ↗ · pubmed ↗
5Schaffner SF, Taylor AR, Wong W, Wirth DF, Neafsey DE. Hmm IBD: Software to infer pairwise identity by descent between haploid genotypes. Malar J. 2018;17:10–3.29764422 10.1186/s 12936-018-2349-7PMC 5952413 · doi ↗ · pubmed ↗
6Henden L, Lee S, Mueller I, Barry A, Bahlo M. Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens. P Lo S Genet. 2018;14:e 1007279.29791438 10.1371/journal.pgen.1007279 PMC 5988311 · doi ↗ · pubmed ↗
7Browning SR, Browning BL. Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. Am J Hum Genet. 2015;97:404–18.26299365 10.1016/j.ajhg.2015.07.012PMC 4564943 · doi ↗ · pubmed ↗
8Morgan AP, Brazeau NF, Ngasala B, Mhamilawa LE, Denton M, Msellem M, Falciparum malaria from coastal Tanzania and Zanzibar remains highly connected despite effective control efforts on the archipelago. Malar J. 2020;19:47.31992305 10.1186/s 12936-020-3137-8PMC 6988337 · doi ↗ · pubmed ↗