Spinning Language Models: Risks of Propaganda-As-A-Service and   Countermeasures

Eugene Bagdasaryan; Vitaly Shmatikov

arXiv:2112.05224·cs.CR·October 12, 2022·5 cites

Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures

Eugene Bagdasaryan, Vitaly Shmatikov

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper reveals a new threat called model spinning, where adversaries embed meta-backdoors into language models to produce biased outputs on trigger words, enabling propaganda and malicious content generation without degrading standard performance.

Contribution

The paper introduces a novel backdooring technique for seq2seq models that enables model spinning, supporting biased outputs while maintaining normal accuracy metrics.

Findings

01

Spinned models preserve ROUGE and BLEU scores.

02

Spinning transfers to downstream models in supply-chain attacks.

03

Meta-backdoor technique effectively manipulates language model outputs.

Abstract

We investigate a new threat to neural sequence-to-sequence (seq2seq) models: training-time attacks that cause models to "spin" their outputs so as to support an adversary-chosen sentiment or point of view -- but only when the input contains adversary-chosen trigger words. For example, a spinned summarization model outputs positive summaries of any text that mentions the name of some individual or organization. Model spinning introduces a "meta-backdoor" into a model. Whereas conventional backdoors cause models to produce incorrect outputs on inputs with the trigger, outputs of spinned models preserve context and maintain standard accuracy metrics, yet also satisfy a meta-task chosen by the adversary. Model spinning enables propaganda-as-a-service, where propaganda is defined as biased speech. An adversary can create customized language models that produce desired spins for chosen…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ebagdasa/propaganda_as_a_service
jaxOfficial

Models

🤗
ebagdasa/propaganda_positive_bart
model· 55 dl
55 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence