RxSafeBench: Identifying Medication Safety Issues of Large Language Models in Simulated Consultation

Jiahao Zhao; Luxin Xu; Minghuan Tan; Lichao Zhang; Ahmadreza Argha; Hamid Alinejad-Rokny; Min Yang

arXiv:2511.04328·cs.AI·November 7, 2025

RxSafeBench: Identifying Medication Safety Issues of Large Language Models in Simulated Consultation

Jiahao Zhao, Luxin Xu, Minghuan Tan, Lichao Zhang, Ahmadreza Argha, Hamid Alinejad-Rokny, Min Yang

PDF

Open Access

TL;DR

This paper introduces RxSafeBench, a comprehensive benchmark for evaluating medication safety in large language models within simulated clinical consultations, addressing a critical gap in healthcare AI safety assessment.

Contribution

It creates a realistic, high-quality benchmark with a large safety database and evaluates LLMs' ability to recommend safe medications, highlighting current limitations.

Findings

01

LLMs struggle with contraindication and interaction knowledge.

02

Risks are harder to detect when implied rather than explicit.

03

Benchmark enables systematic assessment of medication safety in LLMs.

Abstract

Numerous medical systems powered by Large Language Models (LLMs) have achieved remarkable progress in diverse healthcare tasks. However, research on their medication safety remains limited due to the lack of real world datasets, constrained by privacy and accessibility issues. Moreover, evaluation of LLMs in realistic clinical consultation settings, particularly regarding medication safety, is still underexplored. To address these gaps, we propose a framework that simulates and evaluates clinical consultations to systematically assess the medication safety capabilities of LLMs. Within this framework, we generate inquiry diagnosis dialogues with embedded medication risks and construct a dedicated medication safety database, RxRisk DB, containing 6,725 contraindications, 28,781 drug interactions, and 14,906 indication-drug pairs. A two-stage filtering strategy ensures clinical realism and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling