Apple of Sodom: Hidden Backdoors in Superior Sentence Embeddings via Contrastive Learning
Xiaoyi Chen, Baisong Xin, Shengfang Zhai, Shiqing Ma, Qingni Shen and, Zhonghai Wu

TL;DR
This paper reveals that contrastive learning enhances sentence embeddings but introduces vulnerabilities to backdoor attacks, proposing BadCSE as a novel framework to inject and evaluate backdoors in such models.
Contribution
It introduces the first backdoor attack framework, BadCSE, demonstrating vulnerabilities in state-of-the-art sentence embeddings under various learning settings.
Findings
Backdoor attacks significantly degrade embedding quality in supervised settings.
BadCSE achieves a 97.70% success rate in targeted backdoor attacks.
Backdoored models retain utility while being susceptible to malicious manipulation.
Abstract
This paper finds that contrastive learning can produce superior sentence embeddings for pre-trained models but is also vulnerable to backdoor attacks. We present the first backdoor attack framework, BadCSE, for state-of-the-art sentence embeddings under supervised and unsupervised learning settings. The attack manipulates the construction of positive and negative pairs so that the backdoored samples have a similar embedding with the target sample (targeted attack) or the negative embedding of its clean version (non-targeted attack). By injecting the backdoor in sentence embeddings, BadCSE is resistant against downstream fine-tuning. We evaluate BadCSE on both STS tasks and other downstream tasks. The supervised non-targeted attack obtains a performance degradation of 194.86%, and the targeted attack maps the backdoored samples to the target embedding with a 97.70% success rate while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Interpreting and Communication in Healthcare · Natural Language Processing Techniques
MethodsContrastive Learning
