Backdooring Bias in Large Language Models
Anudeep Das, Prach Chantasantitam, Gurjot Singh, Lipeng He, Mariia Ponomarenko, Florian Kerschbaum

TL;DR
This paper investigates backdoor attacks in large language models within a white-box setting, analyzing the effectiveness of syntactic and semantic triggers and evaluating defense methods' impact on utility and security.
Contribution
It provides a comprehensive analysis of syntactic and semantic backdoor attacks in LLMs under white-box conditions, including evaluation of defense strategies and their trade-offs.
Findings
Semantic backdoors are more effective at inducing negative biases.
Both attack types can preserve model utility while inducing biases.
Defense methods either reduce utility significantly or are computationally expensive.
Abstract
Large language models (LLMs) are increasingly deployed in settings where inducing a bias toward a certain topic can have significant consequences, and backdoor attacks can be used to produce such models. Prior work on backdoor attacks has largely focused on a black-box threat model, with an adversary targeting the model builder's LLM. However, in the bias manipulation setting, the model builder themselves could be the adversary, warranting a white-box threat model where the attacker's ability to poison, and manipulate the poisoned data is substantially increased. Furthermore, despite growing research in semantically-triggered backdoors, most studies have limited themselves to syntactically-triggered attacks. Motivated by these limitations, we conduct an analysis consisting of over 1000 evaluations using higher poisoning ratios and greater data augmentation to gain a better understanding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
