ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks

Zhiyao Ren; Siyuan Liang; Aishan Liu; Dacheng Tao

arXiv:2507.01321·cs.LG·July 3, 2025

ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks

Zhiyao Ren, Siyuan Liang, Aishan Liu, Dacheng Tao

PDF

Open Access 1 Video

TL;DR

This paper introduces ICLShield, a novel defense mechanism against backdoor attacks in in-context learning for large language models, based on a dual-learning hypothesis and dynamic concept preference adjustment, achieving state-of-the-art results.

Contribution

It proposes the dual-learning hypothesis for understanding ICL backdoor vulnerabilities and introduces ICLShield, a dynamic defense method that effectively mitigates backdoor attacks in LLMs.

Findings

01

ICLShield significantly outperforms existing defenses (+26.02% accuracy)

02

The dual-learning hypothesis explains the dominance of backdoor concepts in ICL vulnerabilities

03

ICLShield maintains high effectiveness on closed-source models like GPT-4

Abstract

In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs) due to its adaptability and parameter-free nature. However, it also introduces a critical vulnerability to backdoor attacks, where adversaries can manipulate LLM behaviors by simply poisoning a few ICL demonstrations. In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts within poisoned demonstrations, jointly influencing the probability of model outputs. Through theoretical analysis, we derive an upper bound for ICL backdoor effects, revealing that the vulnerability is dominated by the concept preference ratio between the task and the backdoor. Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)