Greater than the Sum of Its Parts: Building Substructure into Protein Encoding Models
Robert Calef, Arthur Liang, Manolis Kellis, Marinka Zitnik

TL;DR
This paper introduces Magneton, a comprehensive environment with datasets, benchmarks, and a fine-tuning method to incorporate substructure information into protein models, improving their functional prediction and representation consistency.
Contribution
It presents Magneton, a new framework and dataset for integrating substructure annotations into protein models, enhancing their ability to capture functional and structural details.
Findings
Substructure-tuning improves function prediction accuracy.
Models become more consistent in representing unseen substructure types.
Substructural supervision complements global structure inputs.
Abstract
Protein representation learning has advanced rapidly with the scale-up of sequence and structure supervision, but most models still encode proteins either as per-residue token sequences or as single global embeddings. This overlooks a defining property of protein organization: proteins are built from recurrent, evolutionarily conserved substructures that concentrate biochemical activity and mediate core molecular functions. Although substructures such as domains and functional sites are systematically cataloged, they are rarely used as training signals or representation units in protein models. We introduce Magneton, an environment for developing substructure-aware protein models. Magneton provides (1) a dataset of 530,601 proteins annotated with over 1.7 million substructures spanning 13,075 types, (2) a training framework for incorporating substructures into existing protein models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · Bioinformatics and Genomic Networks
