TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking
Mengyao Du, Han Fang, Haokai Ma, Gang Yang, Quanjun Yin, Shouling Ji, and Ee-Chien Chang

TL;DR
TrapSuffix is a proactive defense method that reshapes language models to drastically reduce jailbreak success rates and enable traceability of adversarial suffixes, without impacting inference speed or requiring significant additional memory.
Contribution
It introduces a lightweight fine-tuning approach that proactively guides models to trap adversaries and trace their suffixes, improving over passive detection methods.
Findings
Reduces attack success rate to below 0.01%
Achieves 87.9% traceability of adversarial suffixes
Requires only 15.87 MB additional memory
Abstract
Suffix-based jailbreak attacks append an adversarial suffix, i.e., a short token sequence, to steer aligned LLMs into unsafe outputs. Since suffixes are free-form text, they admit endlessly many surface forms, making jailbreak mitigation difficult. Most existing defenses depend on passive detection of suspicious suffixes, without leveraging the defender's inherent asymmetric ability to inject secrets and proactively conceal gaps. Motivated by this, we take a controllability-oriented perspective and develop a proactive defense that nudges attackers into a no-win dilemma: either they fall into defender-designed optimization traps and fail to produce an effective adversarial suffix, or they can succeed only by generating adversarial suffixes that carry distinctive, traceable fingerprints. We propose TrapSuffix, a lightweight fine-tuning approach that injects trap-aligned behaviors into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques
