Mapping Biomedical Ontology Terms to IDs: Effect of Domain Prevalence on Prediction Accuracy

Thanh Son Do; Daniel B. Hier; Tayo Obafemi-Ajayi

arXiv:2409.13746·cs.CL·May 13, 2025

Mapping Biomedical Ontology Terms to IDs: Effect of Domain Prevalence on Prediction Accuracy

Thanh Son Do, Daniel B. Hier, Tayo Obafemi-Ajayi

PDF

Open Access

TL;DR

This study investigates how the prevalence of biomedical ontology IDs in literature influences the accuracy of large language models in mapping terms to IDs, revealing high accuracy for high-prevalence IDs but limitations for low-prevalence ones.

Contribution

It demonstrates that ontology ID prevalence in biomedical literature significantly affects LLM mapping accuracy and highlights the lexicalization of high-prevalence gene symbols, informing future biomedical NLP model development.

Findings

01

High prevalence predicts better mapping accuracy for HPO, GO, and UniProtKB IDs.

02

GPT-4 achieves 95% accuracy in mapping protein names to HUGO gene symbols.

03

Mapping accuracy for HUGO gene symbols is unaffected by prevalence due to lexicalization.

Abstract

This study evaluates the ability of large language models (LLMs) to map biomedical ontology terms to their corresponding ontology IDs across the Human Phenotype Ontology (HPO), Gene Ontology (GO), and UniProtKB terminologies. Using counts of ontology IDs in the PubMed Central (PMC) dataset as a surrogate for their prevalence in the biomedical literature, we examined the relationship between ontology ID prevalence and mapping accuracy. Results indicate that ontology ID prevalence strongly predicts accurate mapping of HPO terms to HPO IDs, GO terms to GO IDs, and protein names to UniProtKB accession numbers. Higher prevalence of ontology IDs in the biomedical literature correlated with higher mapping accuracy. Predictive models based on receiver operating characteristic (ROC) curves confirmed this relationship. In contrast, this pattern did not apply to mapping protein names to Human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Softmax · Absolute Position Encodings · Residual Connection