Concept-Guided Backdoor Attack on Vision Language Models
Haoyu Shen, Weimin Lyu, Haotian Xu, Tengfei Ma

TL;DR
This paper introduces concept-guided backdoor attacks on vision-language models, operating at the semantic level to achieve high success rates with stealthier and more versatile malicious behaviors.
Contribution
It proposes two novel concept-level backdoor attack methods, CTP and CGUB, that improve stealthiness and flexibility over pixel-based attacks in vision-language models.
Findings
Both attacks achieve high success rates across models and datasets.
They maintain moderate impact on normal task performance.
Highlighting concept-level vulnerabilities as a new attack surface.
Abstract
Vision-Language Models (VLMs) have achieved impressive progress in multimodal text generation, yet their rapid adoption raises increasing concerns about security vulnerabilities. Existing backdoor attacks against VLMs primarily rely on explicit pixel-level triggers or imperceptible perturbations injected into images. While effective, these approaches reduce stealthiness and remain vulnerable to image-based defenses. We introduce concept-guided backdoor attacks, a new paradigm that operates at the semantic concept level rather than on raw pixels. We propose two different attacks. The first, Concept-Thresholding Poisoning (CTP), uses explicit concepts in natural images as triggers: only samples containing the target concept are poisoned, causing the model to behave normally in all other cases but consistently inject malicious outputs whenever the concept appears. The second, CBL-Guided…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
I like the idea of using CBMs and the possibility of backdoor attacks without poisoning the training data in the second type of backdoor attacks. The writing is easy to follow, and I think the datasets used in this work are sufficient.
I find this work to have several weaknesses that limit its overall contribution and practicality. 1. The paper introduces concept-guided backdoor attacks but does not compare against **C2Attack**, the most relevant concept-level backdoor method for CLIP models. Such a comparison is essential to position the contribution clearly. 2. The need for an auxiliary classifier is unclear. Since CLIP’s zero-shot predictions already provide supervision, it is not evident why CLIP scores cannot be used di
1. The idea of concept-level backdoor attacks is highly original and opens a new direction in VLM security. 2. Both CTP and CGUB are well-designed and effectively realize the concept-guided attack paradigm under different threat models. 3. The evaluation spans multiple architectures, datasets, tasks (captioning, VQA), and baselines, with thorough ablations and analysis. 4. The paper is logically organized, with intuitive figures and readable technical content.
1. The paper only evaluates robustness against the AutoEncoder defense (2017). Testing against newer defenses and proposing potential defenses would better assess real-world risk. 2. In CTP, natural images containing the target concept are used as triggers, meaning the model’s behavior on all such clean inputs is altered. This blurs the boundary between “clean” and “backdoored” data and may violate the standard backdoor assumption of minimal impact on legitimate performance. 3. Unlike CTP, CGU
## Strengths - I think the problem of ensuring that backdoor attack do not have patches or pixel-level artifacts is an interesting problem. In general, a lot of the triggers can be rendered ineffective if the training pipeline uses stronger data augmentations [1] so the problem of creating backdoor attacks that operate at a semantic level is interesting. ## References - [1] https://arxiv.org/abs/2011.09527
## Weaknesses **General** - **Unrealistic assumption** The assumption that the attacker has full access to both the training data and the training pipeline is highly unrealistic. A more practical threat model would be a data poisoning attack, where the attacker only contaminates web-scraped data sources without controlling the actual training [1]. - **Anomalous Behavior is Detectable during Eval** The attack's claim to "stealthiness" is questionable. In standard backdoor attacks, triggers are
. The idea of concept-space backdoor is new and worth studying. . The two attacks (CTP and CGUB) cover different aspects: one through data selection, the other through internal concept intervention. . The experiments cover several VLMs
. The authors claim that concept-level backdoors are more stealthy and generalize better. However, the experiments show no clear improvement compared to older pixel-based attacks. In Table 1 and Table 2, all methods have almost the same ASR and caption quality. There is no real gain in clean performance or success rate. This makes the superiority of concept-guided attacks very questionable. . I also questioned the need for CGUB. If I want to make the model mislabel cats as dogs, why not just ad
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
