RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
Shuhao Chen, Weisen Jiang, Changmiao Wang, Xiaoqing Wu, Xuanren Shi, Yu Zhang, James T. Kwok

TL;DR
RxEval is a new, detailed benchmark for evaluating large language models' ability to recommend patient-specific medications at the prescription level, using real clinical data and complex reasoning tasks.
Contribution
It introduces RxEval, a comprehensive, prescription-level benchmark with real patient data and distractors, to better assess LLMs' clinical prescribing capabilities.
Findings
LLMs show significant variability in performance on RxEval.
Even top models struggle with exact matches, indicating room for improvement.
Error analysis highlights common issues like overlooking patient info and reasoning failures.
Abstract
Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
