Unlocking Biological Workflows for Robust Protein-Text Question Answering: A Dual-Dimensional RAG Framework

Li Ding; Duanyu Feng; Chen Huang; Yangshuai Wang; Yang Li; Wenqiang Lei; See-Kiong Ng

arXiv:2605.17261·cs.IR·May 19, 2026

Unlocking Biological Workflows for Robust Protein-Text Question Answering: A Dual-Dimensional RAG Framework

Li Ding, Duanyu Feng, Chen Huang, Yangshuai Wang, Yang Li, Wenqiang Lei, See-Kiong Ng

PDF

TL;DR

This paper introduces 2D-ProteinRAG, a novel framework that enhances protein-text question answering by integrating biological workflows with dual-dimensional filtering, improving robustness and generalization to out-of-distribution proteins.

Contribution

The paper presents 2D-ProteinRAG, a new framework that enables LLMs to operate within biological research workflows and employs a dual-dimensional filtering strategy for better information extraction.

Findings

01

Achieves state-of-the-art performance on biological OOD benchmarks.

02

Outperforms fine-tuned baselines and other RAG methods.

03

Demonstrates robustness and scalability in real-world scenarios.

Abstract

Protein-Text Question Answering (QA) is crucial for interpreting biological sequences through natural language. The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) that efficiently leverages biological databases and facilitates reasoning offers a potent approach for it. However, constrained by the standard RAG pipeline, these models often rely on curated, static datasets instead of expert-proven biological workflows, lacking the fine-grained information processing and struggling to generalize to novel (OOD) proteins. To bridge this gap, we propose 2D-ProteinRAG, a novel framework that empowers LLMs to operate within the gold-standard biological research workflow (BLAST). To further extract high-quality information from noisy retrieval contexts, we introduce a dual-dimensional (2D) filtering strategy following the expert analytical paradigms.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.