LiveProteinBench: A Contamination-Free Benchmark for Assessing Models' Specialized Capabilities in Protein Science
Dingyi Rong, Zijian Chen, Qi Jia, Kaiwei Zhang, Haotian Lu, Guangtao Zhai, Ning Liu

TL;DR
LiveProteinBench introduces a contamination-free, multimodal benchmark with 12 protein tasks, revealing current LLMs' strengths and limitations in specialized protein reasoning and multimodal data fusion.
Contribution
It presents a novel, contamination-free benchmark for evaluating LLMs on protein tasks, emphasizing multimodal assessment and recent protein data validation.
Findings
General-purpose LLMs outperform domain-specific models by over 20% accuracy.
Multimodal structural information often does not improve performance and can degrade it.
Model performance correlates more with inference cost than parameter count.
Abstract
In contrast to their remarkable performance on general knowledge QA, the true abilities of Large Language Models (LLMs) in tasks demanding deep, specialized reasoning, such as in protein biology, have yet to be thoroughly investigated. Current benchmarks suffer from critical deficiencies, such as data contamination due to outdated test sets, insufficient focus on essential protein-specific tasks, and a neglect of multimodal assessments. To resolve these issues, we introduce LiveProteinBench, a contamination-free, multimodal benchmark of 12 tasks for evaluating LLM performance on protein property and function prediction. Its central innovation lies in a test set composed exclusively of proteins validated after the start of 2025, guaranteeing that the data is novel to all tested models. We benchmarked a suite of prominent general-purpose LLMs and specialized biological LLMs using both…
Peer Reviews
Decision·Submitted to ICLR 2026
- LiveProteinBench addresses major flaws in existing protein evaluation (contamination, outdated tasks, lack of multimodality) with rigorous dataset construction and “live data” principle. - The benchmark offers 12 well-structured tasks grounded in validated annotations; task variety enables broad assessment of biological reasoning.
- The evaluations are zero-shot. It would be valuable to see whether task-tuned or instruction-fine-tuned models can close the generalist-specialist gap.
The paper is original in proposing a live, contamination-free design for benchmarking LLMs in biology. The methodology is rigorous, with carefully defined tasks, fair temporal splits, and reproducibility ensured through public databases. The clarity of the presentation and experimental analyses is high, and the results are significant
The multimodal evaluation relies on 2D structure projections, which may not fully capture 3D relationships; alternative encodings could be discussed. The benchmark focuses only on single-protein properties, omitting interactions or dynamics that are crucial in biological contexts. Evaluation metrics are limited to accuracy. Limited discussions of related works such as [1, 2, 3] [1] STELLA: Towards Protein Function Prediction with Multimodal LLMs Integrating Sequence-Structure Representatio
1- The use of post-2025 protein entries ensures that none of the test data overlaps with pretraining corpora, addressing a critical issue in LLM evaluation: data leakage. 2- Broad evaluation across general and domain-Specific LLMs.
1- Lack of methodological innovation and dataset accessibility. The manuscript does not present any clear methodological innovation beyond the temporal filtering strategy. Furthermore, the authors do not provide access to the benchmark dataset. 2- Authors didn’t provide details on the proteins analyzed. They just mentioned selection criteria from public database, but any further biological information (e.g., biological diversity, sequence novelty, representative of real challenges, …etc). There
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Bioinformatics and Genomic Networks · Biomedical Text Mining and Ontologies
