Bridging the Gaps in Statistical Models of Protein Alignment
Dinithi Sumanaweera, Lloyd Allison, Arun S. Konagurthu

TL;DR
This paper develops a comprehensive statistical model for protein alignment evolution, infers optimal matrices from benchmark data, and introduces MMLSUM, a new best-performing time-dependent Markov matrix for protein sequence analysis.
Contribution
It constructs a complete time-parameterised statistical model for protein alignment evolution and introduces MMLSUM, the best performing matrix based on Shannon information content.
Findings
MMLSUM outperforms existing matrices in benchmarks
All parameters can be inferred from benchmark datasets
A new optimal matrix for protein alignment is identified
Abstract
This work demonstrates how a complete statistical model quantifying the evolution of pairs of aligned proteins can be constructed from a time-parameterised substitution matrix and a time-parameterised 3-state alignment machine. All parameters of such a model can be inferred from any benchmark data-set of aligned protein sequences. This allows us to examine nine well-known substitution matrices on six benchmarks curated using various structural alignment methods; any matrix that does not explicitly model a "time"-dependent Markov process is converted to a corresponding base-matrix that does. In addition, a new optimal matrix is inferred for each of the six benchmarks. Using Minimum Message Length (MML) inference, all 15 matrices are compared in terms of measuring the Shannon information content of each benchmark. This has resulted in a new and clear overall best performed time-dependent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
