CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs

Sijia Chen; Xiaomin Li; Mengxue Zhang; Eric Hanchen Jiang; Qingcheng Zeng; Chen-Hsiang Yu

arXiv:2505.11413·cs.CL·May 19, 2025

CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs

Sijia Chen, Xiaomin Li, Mengxue Zhang, Eric Hanchen Jiang, Qingcheng Zeng, Chen-Hsiang Yu

PDF

Open Access 1 Datasets

TL;DR

CARES is a comprehensive benchmark designed to evaluate and improve the safety and robustness of medical language models against adversarial prompts and jailbreak attacks, with detailed scoring and mitigation strategies.

Contribution

We introduce CARES, a detailed benchmark with a new evaluation protocol and safety score for assessing medical LLM safety and robustness against adversarial prompts.

Findings

01

Many state-of-the-art LLMs are vulnerable to jailbreaks.

02

Models tend to over-refuse safe but atypical queries.

03

A lightweight classifier can mitigate jailbreak attempts.

Abstract

Large language models (LLMs) are increasingly deployed in medical contexts, raising critical concerns about safety, alignment, and susceptibility to adversarial manipulation. While prior benchmarks assess model refusal capabilities for harmful prompts, they often lack clinical specificity, graded harmfulness levels, and coverage of jailbreak-style attacks. We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating LLM safety in healthcare. CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles: direct, indirect, obfuscated, and role-play, to simulate both malicious and benign use cases. We propose a three-way response evaluation protocol (Accept, Caution, Refuse) and a fine-grained Safety Score metric to assess model behavior. Our analysis reveals that many state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

HFXM/CARES-18K
dataset· 133 dl
133 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)