ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings
Hao Wang, Hao Li, Minlie Huang, Lei Sha

TL;DR
This paper introduces ASETF, a new method for creating effective, coherent adversarial suffixes to attack large language models, improving success rates and efficiency over previous techniques.
Contribution
The paper presents ASETF, a novel framework that transforms continuous adversarial embeddings into understandable text, reducing computational costs and enhancing attack success on various LLMs.
Findings
Significantly reduces attack computation time
Achieves higher attack success rates than existing methods
Generates transferable adversarial suffixes for multiple LLMs
Abstract
The safety defense methods of Large language models(LLMs) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. Recent studies found that attaching suffixes to harmful instructions can hack the defense of LLMs and lead to dangerous outputs. However, similar to traditional text adversarial attacks, this approach, while effective, is limited by the challenge of the discrete tokens. This gradient based discrete optimization attack requires over 100,000 LLM calls, and due to the unreadable of adversarial suffixes, it can be relatively easily penetrated by common defense methods such as perplexity filters. To cope with this challenge, in this paper, we proposes an Adversarial Suffix Embedding Translation Framework (ASETF), aimed at transforming continuous adversarial suffix embeddings into coherent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Adversarial Robustness in Machine Learning
