Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs
Sohely Jahan, Ruimin Sun

TL;DR
This paper demonstrates that black-box distillation attacks can replicate medical LLMs' reasoning while removing safety features, exposing significant risks in clinical AI deployment.
Contribution
It introduces a black-box distillation method to replicate medical LLMs and reveals their safety vulnerabilities without access to model internals.
Findings
Surrogate models achieve high fidelity on benign inputs.
86% of adversarial prompts produce unsafe outputs.
Black-box distillation exposes safety risks at low cost.
Abstract
As medical large language models (LLMs) become increasingly integrated into clinical workflows, concerns around alignment robustness, and safety are escalating. Prior work on model extraction has focused on classification models or memorization leakage, leaving the vulnerability of safety-aligned generative medical LLMs underexplored. We present a black-box distillation attack that replicates the domain-specific reasoning of safety-aligned medical LLMs using only output-level access. By issuing 48,000 instruction queries to Meditron-7B and collecting 25,000 benign instruction response pairs, we fine-tune a LLaMA3 8B surrogate via parameter efficient LoRA under a zero-alignment supervision setting, requiring no access to model weights, safety filters, or training data. With a cost of $12, the surrogate achieves strong fidelity on benign inputs while producing unsafe completions for 86%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
