UOR: Universal Backdoor Attacks on Pre-trained Language Models
Wei Du, Peixuan Li, Boqun Li, Haodong Zhao, Gongshen Liu

TL;DR
This paper introduces UOR, a novel, automatic backdoor attack method on pre-trained language models that enhances attack effectiveness and universality across various tasks and architectures.
Contribution
UOR automates trigger selection and output representation learning, enabling more effective, task-agnostic backdoor attacks on PLMs compared to manual approaches.
Findings
UOR outperforms manual methods in attack success rate.
The method demonstrates universality across different PLM architectures.
Effective on various text classification tasks.
Abstract
Backdoors implanted in pre-trained language models (PLMs) can be transferred to various downstream tasks, which exposes a severe security threat. However, most existing backdoor attacks against PLMs are un-targeted and task-specific. Few targeted and task-agnostic methods use manually pre-defined triggers and output representations, which prevent the attacks from being more effective and general. In this paper, we first summarize the requirements that a more threatening backdoor attack against PLMs should satisfy, and then propose a new backdoor attack method called UOR, which breaks the bottleneck of the previous approach by turning manual selection into automatic optimization. Specifically, we define poisoned supervised contrastive learning which can automatically learn the more uniform and universal output representations of triggers for various PLMs. Moreover, we use gradient search…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsContrastive Learning
