Benchmarking large language models for biomedical natural language   processing applications and recommendations

Qingyu Chen; Yan Hu; Xueqing Peng; Qianqian Xie; Qiao Jin; Aidan; Gilson; Maxwell B. Singer; Xuguang Ai; Po-Ting Lai; Zhizheng Wang; Vipina; Kuttichi Keloth; Kalpana Raja; Jiming Huang; Huan He; Fongci Lin; Jingcheng; Du; Rui Zhang; W. Jim Zheng; Ron A. Adelman; Zhiyong Lu; Hua Xu

arXiv:2305.16326·cs.CL·April 29, 2025·41 cites

Benchmarking large language models for biomedical natural language processing applications and recommendations

Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan, Gilson, Maxwell B. Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, Vipina, Kuttichi Keloth, Kalpana Raja, Jiming Huang, Huan He, Fongci Lin, Jingcheng, Du, Rui Zhang, W. Jim Zheng, Ron A. Adelman, Zhiyong Lu

PDF

Open Access 1 Repo

TL;DR

This paper systematically evaluates large language models in biomedical NLP tasks, comparing their performance with traditional models, and provides practical insights and recommendations for their application in the biomedical domain.

Contribution

It offers a comprehensive benchmark of LLMs in BioNLP, highlighting their strengths, limitations, and the importance of fine-tuning for optimal performance.

Findings

01

Traditional fine-tuning outperforms zero or few-shot LLMs in most tasks.

02

GPT-4 excels in reasoning-related biomedical tasks.

03

LLMs exhibit issues like hallucinations and missing information.

Abstract

The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs, GPT and LLaMA representatives on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here we show that traditional fine-tuning outperforms zero or few shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bids-xu-lab/biomedical-nlp-benchmarks
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Multi-Head Attention · Attention Is All You Need · Label Smoothing · Absolute Position Encodings · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Transformer