GRIMM: Genetic stRatification for Inference in Molecular Modeling
Ashley Babjac, Adrienne Hoarfrost

TL;DR
GRIMM introduces a clustering-based benchmark for enzyme function prediction that improves evaluation of model generalization to novel and known enzyme classes, addressing dataset biases and out-of-distribution challenges.
Contribution
It formalizes a genetic stratification strategy for creating realistic training, validation, and test splits, including open-set scenarios, applicable to sequence-based classification tasks.
Findings
Provides a reproducible framework for sequence clustering and data splitting.
Enables evaluation of models on both in-distribution and out-of-distribution enzyme functions.
Applicable to various biological sequence classification problems.
Abstract
The vast majority of biological sequences encode unknown functions and bear little resemblance to experimentally characterized proteins, limiting both our understanding of biology and our ability to harness functional potential for the bioeconomy. Predicting enzyme function from sequence remains a central challenge in computational biology, complicated by low sequence diversity and imbalanced label support in publicly available datasets. Models trained on these data can overestimate performance and fail to generalize. To address this, we introduce GRIMM (Genetic stRatification for Inference in Molecular Modeling), a benchmark for enzyme function prediction that employs genetic stratification: sequences are clustered by similarity and clusters are assigned exclusively to training, validation, or test sets. This ensures that sequences from the same cluster do not appear in multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · Bioinformatics and Genomic Networks
