Trojaning Language Models for Fun and Profit
Xinyang Zhang, Zheng Zhang, Shouling Ji, Ting Wang

TL;DR
This paper uncovers security vulnerabilities in pre-trained language models by introducing TROJAN-LM, a novel attack that can cause NLP systems to malfunction predictably while remaining undetectable on clean data.
Contribution
It presents TROJAN-LM, a new trojaning attack method for language models, demonstrating its effectiveness and properties across multiple models and NLP tasks.
Findings
TROJAN-LM can reliably trigger malicious behavior in state-of-the-art LMs.
The attack maintains high fluency and indistinguishability from normal inputs.
Countermeasures face significant challenges due to the attack's properties.
Abstract
Recent years have witnessed the emergence of a new paradigm of building natural language processing (NLP) systems: general-purpose, pre-trained language models (LMs) are composed with simple downstream models and fine-tuned for a variety of NLP tasks. This paradigm shift significantly simplifies the system development cycles. However, as many LMs are provided by untrusted third parties, their lack of standardization or regulation entails profound security implications, which are largely unexplored. To bridge this gap, this work studies the security threats posed by malicious LMs to NLP systems. Specifically, we present TROJAN-LM, a new class of trojaning attacks in which maliciously crafted LMs trigger host NLP systems to malfunction in a highly predictable manner. By empirically studying three state-of-the-art LMs (BERT, GPT-2, XLNet) in a range of security-critical NLP tasks (toxic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Hate Speech and Cyberbullying Detection
MethodsLinear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Dense Connections · Residual Connection · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Attention Is All You Need · Discriminative Fine-Tuning
