# GOAnnotator: accurate protein function annotation using automatically retrieved literature

**Authors:** Huiying Yan, Hancheng Liu, Shaojun Wang, Shanfeng Zhu

PMC · DOI: 10.1093/bioinformatics/btaf199 · 2025-07-15

## TL;DR

GOAnnotator is a new tool that improves automated protein function annotation by using automatically retrieved literature, making the process more efficient and accurate.

## Contribution

GOAnnotator introduces a novel framework combining improved literature retrieval and enhanced GO term identification for automated protein function annotation.

## Key findings

- GOAnnotator outperforms GORetriever in realistic scenarios by uncovering unique literature.
- The method predicts additional protein functions not previously identified.
- Experiments on benchmark datasets show high-quality functional annotations.

## Abstract

Automated protein function prediction/annotation (AFP) is vital for understanding biological processes and advancing biomedical research. Existing text-based AFP methods including the state-of-the-art method, GORetriever, rely on expert-curated relevant literature, which is costly and time-consuming, and cover only a small portion of the proteins in UniProt. To overcome this limitation, we propose GOAnnotator, a novel framework for automated protein function annotation. It consists of two key modules: PubRetriever, a hybrid system for retrieving and re-ranking relevant literature, and GORetriever+, an enhanced module for identifying Gene Ontology (GO) terms from the retrieved texts. Extensive experiments over three benchmark datasets demonstrate that GOAnnotator delivers high-quality functional annotations, surpassing GORetriever in realistic situations by uncovering unique literature and predicting additional functions. These results highlight its great potential to streamline and enhance annotation of protein functions without relying on manual curation.

The code and data are available at https://github.com/ZhuLab-Fudan/GOAnnotator.

## Full-text entities

- **Genes:** RYR1 (ryanodine receptor 1) [NCBI Gene 6261] {aka CCO, CMYO1A, CMYO1B, CMYP1A, CMYP1B, KDS}, FARP2 (FERM, ARH/RhoGEF and pleckstrin domain protein 2) [NCBI Gene 9855] {aka FIR, FRG, PLEKHC3}, FGR (FGR proto-oncogene, Src family tyrosine kinase) [NCBI Gene 2268] {aka SRC2, c-fgr, c-src2, p55-Fgr, p55c-fgr, p58-Fgr}, AFP (alpha fetoprotein) [NCBI Gene 174] {aka AFPD, FETA, HPAFP}
- **Diseases:** LLMs.a (MESH:D007806), MFO (MESH:C567116), PT (MESH:D006526)
- **Chemicals:** BPO (-)
- **Species:** Giardia (genus) [taxon 5740], Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** SP2024 — Homo sapiens (Human), Xeroderma pigmentosum, complementation group C, Finite cell line (CVCL_M279)

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12261426/full.md

---
Source: https://tomesphere.com/paper/PMC12261426