Data Poisoning for In-context Learning
Pengfei He, Han Xu, Yue Xing, Hui Liu, Makoto Yamada, Jiliang Tang

TL;DR
This paper investigates the vulnerability of in-context learning in large language models to data poisoning attacks, introducing ICLPoison, a framework that demonstrates significant performance degradation through strategic text perturbations.
Contribution
The paper presents ICLPoison, a novel attack framework exploiting ICL mechanisms with discrete text perturbations, revealing critical security vulnerabilities in LLMs.
Findings
ICL performance drops significantly under attack
Demonstrated effectiveness on GPT-4 and other models
Highlights need for defense mechanisms against data poisoning
Abstract
In the domain of large language models (LLMs), in-context learning (ICL) has been recognized for its innovative ability to adapt to new tasks, relying on examples rather than retraining or fine-tuning. This paper delves into the critical issue of ICL's susceptibility to data poisoning attacks, an area not yet fully explored. We wonder whether ICL is vulnerable, with adversaries capable of manipulating example data to degrade model performance. To address this, we introduce ICLPoison, a specialized attacking framework conceived to exploit the learning mechanisms of ICL. Our approach uniquely employs discrete text perturbations to strategically influence the hidden states of LLMs during the ICL process. We outline three representative strategies to implement attacks under our framework, each rigorously evaluated across a variety of models and tasks. Our comprehensive tests, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Digital Media Forensic Detection
