Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data
Darren Kessner, Tom Turner, and John Novembre

TL;DR
This paper introduces an EM algorithm for estimating known haplotype frequencies directly from pooled sequencing data, improving accuracy over existing methods, and provides an open-source implementation.
Contribution
The paper presents a novel EM-based method for haplotype frequency estimation from pooled sequence data, applicable to microbiome and population studies.
Findings
Outperforms existing single-site allele frequency methods
Effective for microbiome and population sequencing
Implemented as open-source software
Abstract
DNA samples are often pooled, either by experimental design, or because the sample itself is a mixture. For example, when population allele frequencies are of primary interest, individual samples may be pooled together to lower the cost of sequencing. Alternatively, the sample itself may be a mixture of multiple species or strains (e.g. bacterial species comprising a microbiome, or pathogen strains in a blood sample). We present an expectation-maximization (EM) algorithm for estimating haplotype frequencies in a pooled sample directly from mapped sequence reads, in the case where the possible haplotypes are known. This method is relevant to the analysis of pooled sequencing data from selection experiments, as well as the calculation of proportions of different strains within a metagenomics sample. Our method outperforms existing methods based on single- site allele frequencies, as well…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
