Rethinking Text-based Protein Understanding: Retrieval or LLM?

Juntong Wu; Zijing Liu; He Cao; Hao Li; Bin Feng; Zishan Shu; Ke Yu; Li Yuan; Yu Li

arXiv:2505.20354·cs.CL·November 11, 2025

Rethinking Text-based Protein Understanding: Retrieval or LLM?

Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, Yu Li

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper critically analyzes current text-based protein understanding models, identifies evaluation issues, and proposes a retrieval-enhanced method that outperforms fine-tuned LLMs in protein-to-text tasks.

Contribution

It introduces a new evaluation framework based on biological entities and a retrieval-based approach that improves performance and efficiency in protein-text understanding.

Findings

01

Retrieval-enhanced method outperforms fine-tuned LLMs

02

Existing benchmarks suffer from data leakage issues

03

New evaluation framework improves assessment accuracy

Abstract

In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IDEA-XL/RAPM
pytorchOfficial

Datasets

TimeRune/Prot-Inst-OOD
dataset· 17 dl
17 dl

Videos

Rethinking Text-based Protein Understanding: Retrieval or LLM?· underline

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Machine Learning in Bioinformatics

MethodsSoftmax · Attention Is All You Need · Focus