LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for   Denial-of-Service Attacks

Qingzhao Zhang; Ziyang Xiong; Z. Morley Mao

arXiv:2410.02916·cs.CR·April 10, 2025

LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks

Qingzhao Zhang, Ziyang Xiong, Z. Morley Mao

PDF

Open Access

TL;DR

This paper uncovers a new security threat where attackers exploit false positives in LLM safeguards, causing denial-of-service by blocking legitimate user requests, highlighting the need for improved robustness of safety mechanisms.

Contribution

The study reveals that false positives in LLM safeguards can be exploited for DoS attacks, introducing novel attack methods and emphasizing the importance of robustness against such threats.

Findings

01

Adversarial prompts can block over 97% of requests on Llama Guard 3.

02

Attack methods include prompt insertion and poisoned fine-tuning.

03

False positives pose a significant security risk for LLM safety mechanisms.

Abstract

Safety is a paramount concern for large language models (LLMs) in open deployment, motivating the development of safeguard methods that enforce ethical and responsible use through safety alignment or guardrail mechanisms. Jailbreak attacks that exploit the \emph{false negatives} of safeguard methods have emerged as a prominent research focus in the field of LLM security. However, we found that the malicious attackers could also exploit false positives of safeguards, i.e., fooling the safeguard model to block safe content mistakenly, leading to a denial-of-service (DoS) affecting LLM users. To bridge the knowledge gap of this overlooked threat, we explore multiple attack methods that include inserting a short adversarial prompt into user prompt templates and corrupting the LLM on the server by poisoned fine-tuning. In both ways, the attack triggers safeguard rejections of user requests…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data

MethodsFocus · LLaMA