Trojaning Language Models for Fun and Profit

Xinyang Zhang; Zheng Zhang; Shouling Ji; Ting Wang

arXiv:2008.00312·cs.CR·March 12, 2021·1 cites

Trojaning Language Models for Fun and Profit

Xinyang Zhang, Zheng Zhang, Shouling Ji, Ting Wang

PDF

Open Access 1 Repo

TL;DR

This paper uncovers security vulnerabilities in pre-trained language models by introducing TROJAN-LM, a novel attack that can cause NLP systems to malfunction predictably while remaining undetectable on clean data.

Contribution

It presents TROJAN-LM, a new trojaning attack method for language models, demonstrating its effectiveness and properties across multiple models and NLP tasks.

Findings

01

TROJAN-LM can reliably trigger malicious behavior in state-of-the-art LMs.

02

The attack maintains high fluency and indistinguishability from normal inputs.

03

Countermeasures face significant challenges due to the attack's properties.

Abstract

Recent years have witnessed the emergence of a new paradigm of building natural language processing (NLP) systems: general-purpose, pre-trained language models (LMs) are composed with simple downstream models and fine-tuned for a variety of NLP tasks. This paradigm shift significantly simplifies the system development cycles. However, as many LMs are provided by untrusted third parties, their lack of standardization or regulation entails profound security implications, which are largely unexplored. To bridge this gap, this work studies the security threats posed by malicious LMs to NLP systems. Specifically, we present TROJAN-LM, a new class of trojaning attacks in which maliciously crafted LMs trigger host NLP systems to malfunction in a highly predictable manner. By empirically studying three state-of-the-art LMs (BERT, GPT-2, XLNet) in a range of security-critical NLP tasks (toxic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alps-lab/trojan-lm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Hate Speech and Cyberbullying Detection

MethodsLinear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Dense Connections · Residual Connection · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Attention Is All You Need · Discriminative Fine-Tuning