Impact of phylogeny on the inference of functional sectors from protein sequence data
Nicola Dietler, Alia Abbara, Subham Choudhury, Anne-Florence Bitbol

TL;DR
This study investigates how phylogenetic relationships influence the detection of functional sectors in proteins from sequence data, demonstrating that certain methods like ICOD are robust to phylogeny effects and combining different approaches enhances functional site identification.
Contribution
The paper introduces a controlled synthetic data approach to dissect phylogeny's impact on sector detection and evaluates the robustness of ICOD and conservation methods on real protein data.
Findings
ICOD is most robust to phylogenetic effects.
Conservation analysis is also quite robust.
Combining ICOD and conservation reveals complementary functional information.
Abstract
Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Bioinformatics and Genomic Networks · Gene expression and cancer classification
MethodsFocus
