Target-driven Attack for Large Language Models

Chong Zhang; Mingyu Jin; Dong Shu; Taowen Wang; Dongfang Liu; Xiaobo; Jin

arXiv:2411.07268·cs.CL·November 14, 2024

Target-driven Attack for Large Language Models

Chong Zhang, Mingyu Jin, Dong Shu, Taowen Wang, Dongfang Liu, Xiaobo, Jin

PDF

TL;DR

This paper introduces a target-driven black-box attack method for large language models that maximizes divergence between clean and attacked texts, improving attack success and revealing security vulnerabilities.

Contribution

It proposes a novel target-driven attack approach using convex optimization and gradient descent, specifically designed for black-box LLM security testing.

Findings

01

Effective attack success across multiple LLMs and datasets

02

Outperforms heuristic black-box attack strategies

03

Highlights security vulnerabilities in current LLMs

Abstract

Current large language models (LLM) provide a strong foundation for large-scale user-oriented natural language tasks. Many users can easily inject adversarial text or instructions through the user interface, thus causing LLM model security challenges like the language model not giving the correct answer. Although there is currently a large amount of research on black-box attacks, most of these black-box attacks use random and heuristic strategies. It is unclear how these strategies relate to the success rate of attacks and thus effectively improve model robustness. To solve this problem, we propose our target-driven black-box attack method to maximize the KL divergence between the conditional probabilities of the clean text and the attack text to redefine the attack's goal. We transform the distance maximization problem into two convex optimization problems based on the attack goal to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.