Identifying relevant positions in proteins by Critical Variable Selection
Silvia Grigolon, Silvio Franz, Matteo Marsili

TL;DR
This paper introduces Critical Variable Selection, a new method to identify key sites in proteins from sequence data, capturing complex dependencies beyond pairwise correlations and revealing biologically relevant structural and functional sites.
Contribution
The paper presents a novel method for extracting relevant protein sites from sequence alignments that captures higher-order dependencies and complements existing analysis techniques.
Findings
Recovers information beyond pairwise correlations
Works effectively with small datasets of a few hundred sequences
Identifies biologically relevant sites consistent with known data
Abstract
Evolution in its course found a variety of solutions to the same optimisation problem. The advent of high-throughput genomic sequencing has made available extensive data from which, in principle, one can infer the underlying structure on which biological functions rely. In this paper, we present a new method aimed at extracting sites encoding structural and func- tional properties from a set of protein primary sequences, namely a Multiple Sequence Alignment. The method, called Critical Variable Selection, is based on the idea that subsets of relevant sites cor- respond to subsequences that occur with a particularly broad frequency distribution in the dataset. By applying this algorithm to in silico sequences, to the Response Regulator Receiver and to the Voltage Sensor Domain of Ion Channels, we show that this procedure recovers not only information encoded in single site statistics and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
