Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures
Eugene Bagdasaryan, Vitaly Shmatikov

TL;DR
This paper reveals a new threat called model spinning, where adversaries embed meta-backdoors into language models to produce biased outputs on trigger words, enabling propaganda and malicious content generation without degrading standard performance.
Contribution
The paper introduces a novel backdooring technique for seq2seq models that enables model spinning, supporting biased outputs while maintaining normal accuracy metrics.
Findings
Spinned models preserve ROUGE and BLEU scores.
Spinning transfers to downstream models in supply-chain attacks.
Meta-backdoor technique effectively manipulates language model outputs.
Abstract
We investigate a new threat to neural sequence-to-sequence (seq2seq) models: training-time attacks that cause models to "spin" their outputs so as to support an adversary-chosen sentiment or point of view -- but only when the input contains adversary-chosen trigger words. For example, a spinned summarization model outputs positive summaries of any text that mentions the name of some individual or organization. Model spinning introduces a "meta-backdoor" into a model. Whereas conventional backdoors cause models to produce incorrect outputs on inputs with the trigger, outputs of spinned models preserve context and maintain standard accuracy metrics, yet also satisfy a meta-task chosen by the adversary. Model spinning enables propaganda-as-a-service, where propaganda is defined as biased speech. An adversary can create customized language models that produce desired spins for chosen…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence
