Target-driven Attack for Large Language Models
Chong Zhang, Mingyu Jin, Dong Shu, Taowen Wang, Dongfang Liu, Xiaobo, Jin

TL;DR
This paper introduces a target-driven black-box attack method for large language models that maximizes divergence between clean and attacked texts, improving attack success and revealing security vulnerabilities.
Contribution
It proposes a novel target-driven attack approach using convex optimization and gradient descent, specifically designed for black-box LLM security testing.
Findings
Effective attack success across multiple LLMs and datasets
Outperforms heuristic black-box attack strategies
Highlights security vulnerabilities in current LLMs
Abstract
Current large language models (LLM) provide a strong foundation for large-scale user-oriented natural language tasks. Many users can easily inject adversarial text or instructions through the user interface, thus causing LLM model security challenges like the language model not giving the correct answer. Although there is currently a large amount of research on black-box attacks, most of these black-box attacks use random and heuristic strategies. It is unclear how these strategies relate to the success rate of attacks and thus effectively improve model robustness. To solve this problem, we propose our target-driven black-box attack method to maximize the KL divergence between the conditional probabilities of the clean text and the attack text to redefine the attack's goal. We transform the distance maximization problem into two convex optimization problems based on the attack goal to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
