LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks
Qingzhao Zhang, Ziyang Xiong, Z. Morley Mao

TL;DR
This paper uncovers a new security threat where attackers exploit false positives in LLM safeguards, causing denial-of-service by blocking legitimate user requests, highlighting the need for improved robustness of safety mechanisms.
Contribution
The study reveals that false positives in LLM safeguards can be exploited for DoS attacks, introducing novel attack methods and emphasizing the importance of robustness against such threats.
Findings
Adversarial prompts can block over 97% of requests on Llama Guard 3.
Attack methods include prompt insertion and poisoned fine-tuning.
False positives pose a significant security risk for LLM safety mechanisms.
Abstract
Safety is a paramount concern for large language models (LLMs) in open deployment, motivating the development of safeguard methods that enforce ethical and responsible use through safety alignment or guardrail mechanisms. Jailbreak attacks that exploit the \emph{false negatives} of safeguard methods have emerged as a prominent research focus in the field of LLM security. However, we found that the malicious attackers could also exploit false positives of safeguards, i.e., fooling the safeguard model to block safe content mistakenly, leading to a denial-of-service (DoS) affecting LLM users. To bridge the knowledge gap of this overlooked threat, we explore multiple attack methods that include inserting a short adversarial prompt into user prompt templates and corrupting the LLM on the server by poisoned fine-tuning. In both ways, the attack triggers safeguard rejections of user requests…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data
MethodsFocus · LLaMA
