Hidden Backdoors in Human-Centric Language Models

Shaofeng Li; Hui Liu; Tian Dong; Benjamin Zi Hao Zhao; Minhui Xue,; Haojin Zhu; Jialiang Lu

arXiv:2105.00164·cs.CL·September 29, 2021·5 cites

Hidden Backdoors in Human-Centric Language Models

Shaofeng Li, Hui Liu, Tian Dong, Benjamin Zi Hao Zhao, Minhui Xue,, Haojin Zhu, Jialiang Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces hidden backdoors in human-centric NLP models using covert triggers that are effective, inconspicuous, and can fool both models and humans, posing significant security risks.

Contribution

The paper presents novel methods for embedding covert triggers into NLP models, demonstrating high attack success rates across multiple security-critical tasks with minimal data injection.

Findings

01

Hidden backdoors achieve over 97% attack success rate in toxic comment detection.

02

Trigger embedding methods are effective with less than 0.5% data injection.

03

Backdoors remain inconspicuous to human inspection while maintaining model functionality.

Abstract

Natural language processing (NLP) systems have been proven to be vulnerable to backdoor attacks, whereby hidden features (backdoors) are trained into a language model and may only be activated by specific inputs (called triggers), to trick the model into producing unexpected behaviors. In this paper, we create covert and natural triggers for textual backdoor attacks, \textit{hidden backdoors}, where triggers can fool both modern language models and human inspection. We deploy our hidden backdoors through two state-of-the-art trigger embedding methods. The first approach via homograph replacement, embeds the trigger into deep neural networks through the visual spoofing of lookalike character replacement. The second approach uses subtle differences between text generated by language models and real natural text to produce trigger sentences with correct grammar and high fluency. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lishaofeng/NLP_Backdoor
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection