Human-Level and Beyond: Benchmarking Large Language Models Against Clinical Pharmacists in Prescription Review

Yan Yang; Mouxiao Bian; Peiling Li; Bingjian Wen; Ruiyao Chen; Kangkun Mao; Xiaojun Ye; Tianbin Li; Pengcheng Chen; Bing Han; Jie Xu; Kaifeng Qiu; Junyan Wu

arXiv:2512.02024·cs.CL·December 3, 2025

Human-Level and Beyond: Benchmarking Large Language Models Against Clinical Pharmacists in Prescription Review

Yan Yang, Mouxiao Bian, Peiling Li, Bingjian Wen, Ruiyao Chen, Kangkun Mao, Xiaojun Ye, Tianbin Li, Pengcheng Chen, Bing Han, Jie Xu, Kaifeng Qiu, Junyan Wu

PDF

Open Access

TL;DR

This paper introduces RxBench, a comprehensive benchmark for evaluating large language models in clinical prescription review, demonstrating that some models can match or surpass human pharmacists in accuracy and robustness.

Contribution

The paper presents RxBench, a standardized, error-type-oriented benchmark for assessing LLMs in prescription review, and shows how targeted fine-tuning can improve model performance.

Findings

01

Leading LLMs outperform others in accuracy and robustness.

02

Some LLMs match or exceed licensed pharmacists in specific tasks.

03

Fine-tuning enhances model performance on short-answer questions.

Abstract

The rapid advancement of large language models (LLMs) has accelerated their integration into clinical decision support, particularly in prescription review. To enable systematic and fine-grained evaluation, we developed RxBench, a comprehensive benchmark that covers common prescription review categories and consolidates 14 frequent types of prescription errors drawn from authoritative pharmacy references. RxBench consists of 1,150 single-choice, 230 multiple-choice, and 879 short-answer items, all reviewed by experienced clinical pharmacists. We benchmarked 18 state-of-the-art LLMs and identified clear stratification of performance across tasks. Notably, Gemini-2.5-pro-preview-05-06, Grok-4-0709, and DeepSeek-R1-0528 consistently formed the first tier, outperforming other models in both accuracy and robustness. Comparisons with licensed pharmacists indicated that leading LLMs can match…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education