ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate   Suffix Embeddings

Hao Wang; Hao Li; Minlie Huang; Lei Sha

arXiv:2402.16006·cs.CL·June 5, 2024·1 cites

ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings

Hao Wang, Hao Li, Minlie Huang, Lei Sha

PDF

Open Access

TL;DR

This paper introduces ASETF, a new method for creating effective, coherent adversarial suffixes to attack large language models, improving success rates and efficiency over previous techniques.

Contribution

The paper presents ASETF, a novel framework that transforms continuous adversarial embeddings into understandable text, reducing computational costs and enhancing attack success on various LLMs.

Findings

01

Significantly reduces attack computation time

02

Achieves higher attack success rates than existing methods

03

Generates transferable adversarial suffixes for multiple LLMs

Abstract

The safety defense methods of Large language models(LLMs) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. Recent studies found that attaching suffixes to harmful instructions can hack the defense of LLMs and lead to dangerous outputs. However, similar to traditional text adversarial attacks, this approach, while effective, is limited by the challenge of the discrete tokens. This gradient based discrete optimization attack requires over 100,000 LLM calls, and due to the unreadable of adversarial suffixes, it can be relatively easily penetrated by common defense methods such as perplexity filters. To cope with this challenge, in this paper, we proposes an Adversarial Suffix Embedding Translation Framework (ASETF), aimed at transforming continuous adversarial suffix embeddings into coherent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Adversarial Robustness in Machine Learning