Data-Error Scaling Laws in Machine Learning on Combinatorial Mutation-prone Sets: Proteins and Small Molecules
Vanni Doffini, O. Anatole von Lilienfeld, Michael A. Nash

TL;DR
This study explores unique data-error scaling laws in machine learning models trained on mutation-prone combinatorial spaces like proteins and small molecules, revealing phase transitions and new normalization strategies.
Contribution
It uncovers discontinuous phase transitions in data-error scaling laws and introduces mutant-based shuffling for better normalization of learning curves.
Findings
Discontinuous phase transitions in test error during learning.
Two distinct learning regimes: saturated and asymptotic decay.
Mutant-based shuffling improves normalization of learning curves.
Abstract
We investigate trends in the data-error scaling laws of machine learning (ML) models trained on discrete combinatorial spaces that are prone-to-mutation, such as proteins or organic small molecules. We trained and evaluated kernel ridge regression machines using variable amounts of computational and experimental training data. Our synthetic datasets comprised i) two na\"ive functions based on many-body theory; ii) binding energy estimates between a protein and a mutagenised peptide; and iii) solvation energies of two 6-heavy atom structural graphs, while the experimental dataset consisted of a full deep mutational scan of the binding protein GB1. In contrast to typical data-error scaling laws, our results showed discontinuous monotonic phase transitions during learning, observed as rapid drops in the test error at particular thresholds of training data. We observed two learning regimes,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Computational Drug Discovery Methods · Advanced Proteomics Techniques and Applications
