Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned   Large Language Models

Xiao Liu; Liangzhi Li; Tong Xiang; Fuying Ye; Lu Wei; Wangyue Li; Noa; Garcia

arXiv:2407.15399·cs.CL·July 23, 2024

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

Xiao Liu, Liangzhi Li, Tong Xiang, Fuying Ye, Lu Wei, Wangyue Li, Noa, Garcia

PDF

Open Access

TL;DR

This paper presents a novel adversarial attack method on large language models that uses human-like conversation strategies to extract harmful information, surpassing previous attack techniques in effectiveness.

Contribution

It introduces a new attack approach exploiting conversational tactics to reveal malicious intents in LLM responses, highlighting a significant security concern.

Findings

01

Effective on GPT-3.5-turbo, GPT-4, and Llama2

02

Outperforms conventional attack methods

03

Raises questions about detecting malicious intent

Abstract

With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ferroelectric and Negative Capacitance Devices

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Linear Warmup With Cosine Annealing · Residual Connection · Dropout · Transformer