ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses
Ningyuan He, Ronghong Huang, Qianqian Tang, Hongyu Wang, Xianghang Mi, Shanqing Guo

TL;DR
This paper introduces ICL-Evader, a zero-query black-box attack framework targeting in-context learning models, demonstrating significant vulnerabilities and proposing effective defense strategies to enhance robustness.
Contribution
The paper presents the first zero-query black-box evasion attacks on ICL, along with a systematic defense approach and an automated tool to improve ICL robustness.
Findings
Achieved up to 95.3% attack success rate.
Traditional NLP attacks are ineffective under zero-query constraints.
Proposed defenses reduce attack success with less than 5% utility loss.
Abstract
In-context learning (ICL) has become a powerful, data-efficient paradigm for text classification using large language models. However, its robustness against realistic adversarial threats remains largely unexplored. We introduce ICL-Evader, a novel black-box evasion attack framework that operates under a highly practical zero-query threat model, requiring no access to model parameters, gradients, or query-based feedback during attack generation. We design three novel attacks, Fake Claim, Template, and Needle-in-a-Haystack, that exploit inherent limitations of LLMs in processing in-context prompts. Evaluated across sentiment analysis, toxicity, and illicit promotion tasks, our attacks significantly degrade classifier performance (e.g., achieving up to 95.3% attack success rate), drastically outperforming traditional NLP attacks which prove ineffective under the same constraints. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
