Vocabulary Attack to Hijack Large Language Model Applications
Patrick Levi, Christoph P. Neumann

TL;DR
This paper introduces a vocabulary-based attack method that subtly manipulates large language models by inserting specific words, effectively hijacking their behavior without detection, even across different models.
Contribution
It presents a novel vocabulary insertion attack technique using an optimization process and embeddings from an attacker model, demonstrating its effectiveness on popular open-source LLMs.
Findings
Inconspicuous instructions are hard to detect.
Single word insertion can be sufficient for successful attacks.
Cross-model attack capability demonstrated.
Abstract
The fast advancements in Large Language Models (LLMs) are driving an increasing number of applications. Together with the growing number of users, we also see an increasing number of attackers who try to outsmart these systems. They want the model to reveal confidential information, specific false information, or offensive behavior. To this end, they manipulate their instructions for the LLM by inserting separators or rephrasing them systematically until they reach their goal. Our approach is different. It inserts words from the model vocabulary. We find these words using an optimization procedure and embeddings from another LLM (attacker LLM). We prove our approach by goal hijacking two popular open-source LLMs from the Llama2 and the Flan-T5 families, respectively. We present two main findings. First, our approach creates inconspicuous instructions and therefore it is hard to detect.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsFlan-T5
