MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems
Kai Chen, Taihang Zhen, Hewei Wang, Kailai Liu, Xinfeng Li, Jing Huo, Tianpei Yang, Jinfeng Xu, Wei Dong, Yang Gao

TL;DR
MedSentry provides a comprehensive benchmark and evaluation framework for assessing and improving the safety of multi-agent medical LLM systems against adversarial threats, highlighting architecture vulnerabilities and proposing mitigation strategies.
Contribution
Introduces MedSentry, a novel benchmark and evaluation pipeline for analyzing safety risks in multi-agent medical LLM systems, along with detection and correction methods for malicious agents.
Findings
SharedPool is highly susceptible to attacks due to open information sharing.
Decentralized architectures show greater resilience against adversarial attacks.
Proposed mitigation restores system safety close to baseline levels.
Abstract
As large language models (LLMs) are increasingly deployed in healthcare, ensuring their safety, particularly within collaborative multi-agent configurations, is paramount. In this paper we introduce MedSentry, a benchmark comprising 5 000 adversarial medical prompts spanning 25 threat categories with 100 subthemes. Coupled with this dataset, we develop an end-to-end attack-defense evaluation pipeline to systematically analyze how four representative multi-agent topologies (Layers, SharedPool, Centralized, and Decentralized) withstand attacks from 'dark-personality' agents. Our findings reveal critical differences in how these architectures handle information contamination and maintain robust decision-making, exposing their underlying vulnerability mechanisms. For instance, SharedPool's open information sharing makes it highly susceptible, whereas Decentralized architectures exhibit…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper focuses on multi-agent systems, which are increasingly deployed in healthcare but remain underexplored from a security perspective. This is a valuable contribution distinct from single-model medical AI benchmarks. 2. The paper is a systematic investigation of topological vulnerabilities in medical multi-agent systems. The finding that topology choice significantly impacts safety is actionable for system designers. 3. Benchmark scale and organization appear comprehensive in the
1. The paper introduces "dark-personality" adversarial agents without a reasonable threat model. Dark Triad and other psychometrics are human constructs with questionable applicability to AI agents. This fundamental conceptual issue undermines the entire PCDC and evaluation framework. 2. The paper needs component-wise analysis to understand what drives the results. The paper does not do ablation studies on components of PCDC. Regarding "the defense pipeline as an integrated end to end process"
- The paper tackles an important and underexplored problem, evaluating the safety of medical LLM agents under multi-agent setting. - The benchmark design is comprehensive, encompassing multiple medical stages and diverse risk categories. This makes it representative. - Extensive experiments are done to analyze a wide range of factors that could possibly affect the risk of the multi-agent system.
- The description of the attack instruction process in *Coarse-Grained Data Generation* is vague. It is unclear how the process is iterative and how "topics and subtopics are substituted" across iterations. More explicit examples or algorithmic details would improve clarity. - The evaluation design may introduce confounding factors across different topologies. As mentioned in Lines 209–210, the evaluator agent receives different amounts of information under different setups. This raises the conc
1. Investigating the resilience of various multi-agent architectures to adversarial prompts is a valuable and relevant direction, offering insights into the robustness and design trade-offs of safety-critical LLM systems.
1. Existing safety benchmarks, like MedSafetyBench, in the medical domain already address various risks, and it is unclear how the proposed benchmark distinguishes itself, aside from the inclusion of a malicious agent. 2. The evaluated scenarios lack practical relevance, as the likelihood of inserting a malicious agent into real-world medical systems is very low. 3. The study does not employ advanced attack methods like [1, 2], resulting in prompts that are not sufficiently adversarial to meani
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElectronic Health Records Systems · Biomedical Text Mining and Ontologies · Semantic Web and Ontologies
