Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
Mahiro Nakao, Kazuhiro Takemoto

TL;DR
This study evaluates the safety of 72 large language models in robotic health attendant control using a new dataset of harmful instructions, revealing significant safety concerns and the limited effectiveness of current mitigation strategies.
Contribution
It introduces a comprehensive dataset for safety evaluation and provides empirical insights into the safety performance of various LLMs in medical robotic contexts.
Findings
Over half of the models exceeded 50% violation rate.
Proprietary models were significantly safer than open-weight models.
Prompt-based defenses only modestly reduced violation rates.
Abstract
Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4\%, with more than half exceeding 50\%, and violation rates varied substantially across behavior categories, with superficially plausible instructions such as device manipulation and emergency delay proving harder to refuse than overtly destructive ones. Model size and release date were the primary determinants of safety performance among open-weight models, and proprietary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
