Hidden Backdoors in Human-Centric Language Models
Shaofeng Li, Hui Liu, Tian Dong, Benjamin Zi Hao Zhao, Minhui Xue,, Haojin Zhu, Jialiang Lu

TL;DR
This paper introduces hidden backdoors in human-centric NLP models using covert triggers that are effective, inconspicuous, and can fool both models and humans, posing significant security risks.
Contribution
The paper presents novel methods for embedding covert triggers into NLP models, demonstrating high attack success rates across multiple security-critical tasks with minimal data injection.
Findings
Hidden backdoors achieve over 97% attack success rate in toxic comment detection.
Trigger embedding methods are effective with less than 0.5% data injection.
Backdoors remain inconspicuous to human inspection while maintaining model functionality.
Abstract
Natural language processing (NLP) systems have been proven to be vulnerable to backdoor attacks, whereby hidden features (backdoors) are trained into a language model and may only be activated by specific inputs (called triggers), to trick the model into producing unexpected behaviors. In this paper, we create covert and natural triggers for textual backdoor attacks, \textit{hidden backdoors}, where triggers can fool both modern language models and human inspection. We deploy our hidden backdoors through two state-of-the-art trigger embedding methods. The first approach via homograph replacement, embeds the trigger into deep neural networks through the visual spoofing of lookalike character replacement. The second approach uses subtle differences between text generated by language models and real natural text to produce trigger sentences with correct grammar and high fluency. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
