Ultra-large alignments using Phylogeny-aware Profiles
Nam-phuong Nguyen, Siavash Mirarab, Keerthana Kumar, Tandy Warnow

TL;DR
UPP is a new machine learning-based method that produces highly accurate multiple sequence alignments for large and fragmentary datasets, improving biological analyses like evolutionary history estimation.
Contribution
The paper introduces UPP, a novel alignment method using an ensemble of Hidden Markov Models for ultra-large and fragmentary sequence datasets.
Findings
Achieves high accuracy on large datasets
Performs well with fragmentary sequences
Applicable to both nucleotide and amino acid sequences
Abstract
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments (MSAs) and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, an MSA method that uses a new machine learning technique - the Ensemble of Hidden Markov Models - that we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Algorithms and Data Compression
