VenusX: Unlocking Fine-Grained Functional Understanding of Proteins
Yang Tan, Wenrui Gou, Bozitao Zhong, Liang Hong, Huiqun Yu, Bingxin Zhou

TL;DR
VenusX introduces a comprehensive large-scale benchmark for detailed protein function annotation at multiple levels, enabling better evaluation of models' ability to understand protein mechanisms and out-of-distribution generalization.
Contribution
This work presents the first large-scale benchmark for fine-grained protein functional annotation, covering residue, fragment, and domain levels with diverse tasks and datasets.
Findings
Baseline models show varied performance across tasks.
Benchmark enables assessment of in-distribution and out-of-distribution generalization.
Publicly available code and data facilitate future research.
Abstract
Deep learning models have driven significant progress in predicting protein function and interactions at the protein level. While these advancements have been invaluable for many biological applications such as enzyme engineering and function annotation, a more detailed perspective is essential for understanding protein functional mechanisms and evaluating the biological knowledge captured by models. To address this demand, we introduce VenusX, the first large-scale benchmark for fine-grained functional annotation and function-based protein pairing at the residue, fragment, and domain levels. VenusX comprises three major task categories across six types of annotations, including residue-level binary classification, fragment-level multi-class classification, and pairwise functional similarity scoring for identifying critical active sites, binding sites, conserved sites, motifs, domains,…
Peer Reviews
Decision·ICLR 2026 Poster
1. VENUSX introduces a new focus on fine-grained protein functional understanding through novel residue-, fragment-, and similarity-based tasks, with cross-family splits to test generalization, addressing a key gap in evaluating localized biological signals. 2. The benchmark is well-constructed from over 878k curated samples, with detailed task definitions, appropriate metrics, diverse baselines, and clear documentation, making it practical and accessible for research.
1. The evaluation lacks newer protein language models such as ESM3, which models sequence, structure, and function jointly; its absence limits the benchmark’s ability to reflect current state-of-the-art performance on fine-grained functional tasks. 2. Residue- and fragment-level tasks use only frozen residue embeddings passed through two linear layers; the lack of experiments on fine-tuning language models leaves unaddressed whether supervised adaptation can improve capture of localized functio
This problem is important for analyzing proteins, both for making predictions for the wide variety of uncharacterized natural proteins and for designing new proteins. The benchmark appears to be well organized and accessible for future researchers.
I have key questions about how fragments are defined for certain annotation types. See below. There are no non-deep-learning baselines for some tasks. Both the fragment-level classification and fragment similarity tasks don't seem to be simulating a workflow that practitioners would encounter in practice. How practical is it to be presented with a pre-identified sub-sequence of the protein without knowledge of it's functional label? Usually, this is presented as a sequence segmentation problem
- VenusX represents the first large-scale benchmark specifically targeting fine-grained, sub-protein functional understanding, addressing a critical gap in existing protein benchmarks that largely focus on protein-level properties. - The benchmark encompasses a diverse set of tasks (residue-level, fragment-level, and pairwise functional similarity), multiple annotation types (active sites, binding sites, epitopes, etc.), and meticulously designed data splits (mixed-family, cross-family, and mult
- Although a wide range of models is evaluated, the paper provides limited insight into why certain approaches—such as sequence-structure hybrids—perform better on cross-family splits. Incorporating ablation studies or attention visualization could shed light on the learned representations and their functional relevance. - Performance on the epitope prediction (Epi) task is notably low (AUPR < 0.3), even for top-performing models. A more detailed discussion of the intrinsic challenges of this ta
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · vaccines and immunoinformatics approaches
MethodsSparse Evolutionary Training
