Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously

Andrew Adiletta; Kathryn Adiletta; Kemal Derya; Berk Sunar

arXiv:2512.11783·cs.CR·December 15, 2025

Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously

Andrew Adiletta, Kathryn Adiletta, Kemal Derya, Berk Sunar

PDF

Open Access

TL;DR

This paper introduces Super Suffixes, a method to bypass text generation guard models across multiple LLMs, and proposes DeltaGuard, a detection technique based on internal state analysis, to improve security against adversarial prompts.

Contribution

It demonstrates the first successful bypass of Llama Prompt Guard 2 using joint optimization and introduces DeltaGuard, a lightweight detection method based on internal state similarity analysis.

Findings

01

Super Suffixes can bypass multiple guard models effectively.

02

DeltaGuard detects Super Suffix attacks with nearly 100% accuracy.

03

Joint optimization reveals vulnerabilities in existing guard models.

Abstract

The rapid deployment of Large Language Models (LLMs) has created an urgent need for enhanced security and privacy measures in Machine Learning (ML). LLMs are increasingly being used to process untrusted text inputs and even generate executable code, often while having access to sensitive system controls. To address these security concerns, several companies have introduced guard models, which are smaller, specialized models designed to protect text generation models from adversarial or malicious inputs. In this work, we advance the study of adversarial inputs by introducing Super Suffixes, suffixes capable of overriding multiple alignment objectives across various models with different tokenization schemes. We demonstrate their effectiveness, along with our joint optimization technique, by successfully bypassing the protection mechanisms of Llama Prompt Guard 2 on five different text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Spam and Phishing Detection