Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously
Andrew Adiletta, Kathryn Adiletta, Kemal Derya, Berk Sunar

TL;DR
This paper introduces Super Suffixes, a method to bypass text generation guard models across multiple LLMs, and proposes DeltaGuard, a detection technique based on internal state analysis, to improve security against adversarial prompts.
Contribution
It demonstrates the first successful bypass of Llama Prompt Guard 2 using joint optimization and introduces DeltaGuard, a lightweight detection method based on internal state similarity analysis.
Findings
Super Suffixes can bypass multiple guard models effectively.
DeltaGuard detects Super Suffix attacks with nearly 100% accuracy.
Joint optimization reveals vulnerabilities in existing guard models.
Abstract
The rapid deployment of Large Language Models (LLMs) has created an urgent need for enhanced security and privacy measures in Machine Learning (ML). LLMs are increasingly being used to process untrusted text inputs and even generate executable code, often while having access to sensitive system controls. To address these security concerns, several companies have introduced guard models, which are smaller, specialized models designed to protect text generation models from adversarial or malicious inputs. In this work, we advance the study of adversarial inputs by introducing Super Suffixes, suffixes capable of overriding multiple alignment objectives across various models with different tokenization schemes. We demonstrate their effectiveness, along with our joint optimization technique, by successfully bypassing the protection mechanisms of Llama Prompt Guard 2 on five different text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Spam and Phishing Detection
