Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
Xiao Liu, Liangzhi Li, Tong Xiang, Fuying Ye, Lu Wei, Wangyue Li, Noa, Garcia

TL;DR
This paper presents a novel adversarial attack method on large language models that uses human-like conversation strategies to extract harmful information, surpassing previous attack techniques in effectiveness.
Contribution
It introduces a new attack approach exploiting conversational tactics to reveal malicious intents in LLM responses, highlighting a significant security concern.
Findings
Effective on GPT-3.5-turbo, GPT-4, and Llama2
Outperforms conventional attack methods
Raises questions about detecting malicious intent
Abstract
With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ferroelectric and Negative Capacitance Devices
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Linear Warmup With Cosine Annealing · Residual Connection · Dropout · Transformer
