Aligning 415 519 proteins in less than two hours on PC
Sebastin Deorowicz, Agnieszka Debudaj-Grabysz, Adam Gudys

TL;DR
FAMSA is a highly optimized, parallelized algorithm that rapidly aligns large protein sequence datasets with superior accuracy and minimal resource usage, exemplified by aligning over 415,000 sequences in under two hours.
Contribution
The paper introduces FAMSA, a novel progressive alignment algorithm that significantly improves speed and accuracy for large protein datasets using innovative similarity measures and optimization techniques.
Findings
FAMSA outperforms Clustal Omega and MAFFT on large datasets.
It achieves high-quality alignments with lower time and memory requirements.
Successfully aligned 415,519 sequences in less than two hours on a standard PC.
Abstract
Rapid development of modern sequencing platforms enabled an unprecedented growth of protein families databases. The abundance of sets composed of hundreds of thousands sequences is a great challenge for multiple sequence alignment algorithms. In the article we introduce FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilisation of longest common subsequence measure for determining pairwise similarities, a novel method of gap costs evaluation, and a new iterative refinement scheme. Importantly, its implementation is highly optimised and parallelised to make the most of modern computer platforms. Thanks to the above, quality indicators, namely sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms like Clustal Omega or MAFFT for datasets exceeding a few thousand of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Algorithms and Data Compression · Machine Learning in Bioinformatics
