Benchmarking GPT-5 for biomedical natural language processing
Yu Hou, Zaifu Zhan, Min Zeng, Yifan Wu, Shuang Zhou, Rui Zhang

TL;DR
This study benchmarks GPT-5 and GPT-4o on diverse biomedical NLP tasks, demonstrating GPT-5's superior performance, efficiency, and potential for deployment in complex biomedical applications.
Contribution
It extends a comprehensive benchmark to evaluate GPT-5 across multiple biomedical NLP tasks, highlighting its improved performance and cost-efficiency over GPT-4o.
Findings
GPT-5 outperforms GPT-4o on reasoning-intensive datasets
GPT-5 achieves better chemical NER and relation extraction scores
GPT-5 offers lower effective cost per correct prediction despite longer outputs
Abstract
Biomedical literature and clinical narratives pose multifaceted challenges for natural language understanding, from precise entity extraction and document synthesis to multi-step diagnostic reasoning. This study extends a unified benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across five core biomedical NLP tasks: named entity recognition, relation extraction, multi-label document classification, summarization, and simplification, and nine expanded biomedical QA datasets covering factual knowledge, clinical reasoning, and multimodal visual understanding. Using standardized prompts, fixed decoding parameters, and consistent inference pipelines, we assessed model performance, latency, and token-normalized cost under official pricing. GPT-5 consistently outperformed GPT-4o, with the largest gains on reasoning-intensive datasets such as MedXpertQA and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
