Score distributions of gapped multiple sequence alignments down to the low-probability tail
Pascal Fieth, Alexander K. Hartmann

TL;DR
This paper investigates the score distributions of gapped multiple sequence alignments, revealing deviations from classical models and refining statistical methods to better assess alignment significance in biological research.
Contribution
It extends previous studies from pairwise to multiple sequence alignments, demonstrating that their score distributions differ and require refined statistical models.
Findings
Score distributions for multiple alignments differ from pairwise cases.
Deviations from Gumbel distribution are observed in finite sequence lengths.
Refined Gaussian corrections improve the modeling of score distributions.
Abstract
Assessing the significance of alignment scores of optimally aligned DNA or amino acid sequences can be achieved via the knowledge of the score distribution of random sequences. But this requires obtaining the distribution in the biologically relevant high-scoring region, where the probabilities are exponentially small. For gapless local alignments of infinitely long sequences this distribution is known analytically to follow a Gumbel distribution. Distributions for gapped local alignments and global alignments of finite lengths can only be obtained numerically. To obtain result for the small-probability region, specific statistical mechanics-based rare-event algorithms can be applied. In previous studies, this was achieved for pairwise alignments. They showed that, contrary to results from previous simple sampling studies, strong deviations from the Gumbel distribution occur in case of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
