Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA
Mingyu Huang, Shasha Zhou, Ke Li

TL;DR
This paper introduces GraphFLA, a Python framework that enhances biological fitness prediction benchmarks by adding landscape topographical features, enabling better interpretation and comparison of model performance across diverse biological datasets.
Contribution
GraphFLA provides a novel method to analyze and interpret fitness landscapes with biologically relevant features, improving benchmarking of mutational effect prediction models.
Findings
GraphFLA successfully characterizes landscape topography across diverse datasets.
Application of GraphFLA reveals factors influencing model accuracy.
Release of extensive empirical fitness landscapes for future research.
Abstract
Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from mutagensis data in diverse modalities (e.g., DNA, RNA, protein, and beyond) with up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
