Spinning Sequence-to-Sequence Models with Meta-Backdoors
Eugene Bagdasaryan, Vitaly Shmatikov

TL;DR
This paper introduces a novel type of backdoor attack called meta-backdoors on seq2seq models, enabling adversaries to manipulate outputs to support specific sentiments while preserving overall task performance.
Contribution
It presents the concept of meta-backdoors, a new backdoor class for seq2seq models, and develops a technique to embed adversary-chosen sentiments without degrading standard metrics.
Findings
Meta-backdoors can embed adversary-chosen sentiments into seq2seq outputs.
The proposed technique maintains task accuracy while altering sentiment.
Model spinning poses risks for AI disinformation campaigns.
Abstract
We investigate a new threat to neural sequence-to-sequence (seq2seq) models: training-time attacks that cause models to "spin" their output and support a certain sentiment when the input contains adversary-chosen trigger words. For example, a summarization model will output positive summaries of any text that mentions the name of some individual or organization. We introduce the concept of a "meta-backdoor" to explain model-spinning attacks. These attacks produce models whose output is valid and preserves context, yet also satisfies a meta-task chosen by the adversary (e.g., positive sentiment). Previously studied backdoors in language models simply flip sentiment labels or replace words without regard to context. Their outputs are incorrect on inputs with the trigger. Meta-backdoors, on the other hand, are the first class of backdoors that can be deployed against seq2seq models to (a)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Misinformation and Its Impacts
MethodsMulti-Head Attention · Attention Is All You Need · FLIP · Linear Layer · Softmax · Dense Connections · Adam · Layer Normalization · Dropout · Byte Pair Encoding
