Statistical physics of interacting proteins: impact of dataset size and quality assessed in synthetic sequences
Carlos A. Gandarilla-P\'erez, Pierre Mergny, Martin Weigt,, Anne-Florence Bitbol

TL;DR
This study uses synthetic data to analyze how dataset size and quality affect the performance of inverse statistical physics methods, like DCA, in predicting protein-protein interactions and residue contacts.
Contribution
It formalizes the relationship between dataset size, quality, and inference accuracy using Ising models and synthetic sequences, providing insights into optimal data utilization.
Findings
DCA performs well with large datasets.
Iterative pairing can predict without training data.
Performance scales quadratically with dataset quality and size.
Abstract
Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g. Direct Coupling Analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins, and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins, and inter-block couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte-Carlo simulations. We show that DCA is able to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
