A hierarchical Dirichlet process mixture model for haplotype reconstruction from multi-population data
Kyung-Ah Sohn, Eric P. Xing

TL;DR
This paper introduces a hierarchical Dirichlet process mixture model for haplotype reconstruction from multi-population genomic data, effectively handling the open-ended number of clusters and shared ancestral structures.
Contribution
It presents a novel nonparametric Bayesian model and a new haplotype inference program, Haploi, leveraging multi-population data for improved accuracy and speed.
Findings
Haploi outperforms existing methods in speed and accuracy.
The hierarchical Dirichlet process effectively models shared population structures.
The model handles large, heterogeneous genomic datasets efficiently.
Abstract
The perennial problem of "how many clusters?" remains an issue of substantial interest in data mining and machine learning communities, and becomes particularly salient in large data sets such as populational genomic data where the number of clusters needs to be relatively large and open-ended. This problem gets further complicated in a co-clustering scenario in which one needs to solve multiple clustering problems simultaneously because of the presence of common centroids (e.g., ancestors) shared by clusters (e.g., possible descents from a certain ancestor) from different multiple-cluster samples (e.g., different human subpopulations). In this paper we present a hierarchical nonparametric Bayesian model to address this problem in the context of multi-population haplotype inference. Uncovering the haplotypes of single nucleotide polymorphisms is essential for many biological and medical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
