Vocabulary Attack to Hijack Large Language Model Applications

Patrick Levi; Christoph P. Neumann

arXiv:2404.02637·cs.CR·May 31, 2024·2 cites

Vocabulary Attack to Hijack Large Language Model Applications

Patrick Levi, Christoph P. Neumann

PDF

Open Access

TL;DR

This paper introduces a vocabulary-based attack method that subtly manipulates large language models by inserting specific words, effectively hijacking their behavior without detection, even across different models.

Contribution

It presents a novel vocabulary insertion attack technique using an optimization process and embeddings from an attacker model, demonstrating its effectiveness on popular open-source LLMs.

Findings

01

Inconspicuous instructions are hard to detect.

02

Single word insertion can be sufficient for successful attacks.

03

Cross-model attack capability demonstrated.

Abstract

The fast advancements in Large Language Models (LLMs) are driving an increasing number of applications. Together with the growing number of users, we also see an increasing number of attackers who try to outsmart these systems. They want the model to reveal confidential information, specific false information, or offensive behavior. To this end, they manipulate their instructions for the LLM by inserting separators or rephrasing them systematically until they reach their goal. Our approach is different. It inserts words from the model vocabulary. We find these words using an optimization procedure and embeddings from another LLM (attacker LLM). We prove our approach by goal hijacking two popular open-source LLMs from the Llama2 and the Flan-T5 families, respectively. We present two main findings. First, our approach creates inconspicuous instructions and therefore it is hard to detect.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsFlan-T5