Hijacking Large Language Models via Adversarial In-Context Learning

Xiangyu Zhou; Yao Qiang; Saleh Zare Zade; Prashant Khanduri; Dongxiao Zhu

arXiv:2311.09948·cs.LG·May 30, 2025·2 cites

Hijacking Large Language Models via Adversarial In-Context Learning

Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Prashant Khanduri, Dongxiao Zhu

PDF

Open Access 1 Repo

TL;DR

This paper reveals security vulnerabilities in large language models during in-context learning by introducing a novel adversarial prompt injection attack and proposing defense strategies to improve robustness.

Contribution

It presents a transferable, gradient-based prompt injection attack against ICL and introduces defense methods using clean demos to enhance LLM robustness.

Findings

01

The attack effectively hijacks LLM outputs across tasks.

02

Defense strategies significantly reduce attack success rate.

03

Experimental results validate the attack's transferability and defense effectiveness.

Abstract

In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks by utilizing labeled examples as demonstrations (demos) in the preconditioned prompts. Despite its promising performance, crafted adversarial attacks pose a notable threat to the robustness of LLMs. Existing attacks are either easy to detect, require a trigger in user input, or lack specificity towards ICL. To address these issues, this work introduces a novel transferable prompt injection attack against ICL, aiming to hijack LLMs to generate the target output or elicit harmful responses. In our threat model, the hacker acts as a model publisher who leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demos via prompt injection. We also propose effective defense strategies using a few shots of clean demos,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RookieZxy/GGI-attack
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques