Analyzing And Editing Inner Mechanisms Of Backdoored Language Models
Max Lamparth, Anka Reuel

TL;DR
This paper investigates the internal mechanisms of backdoored language models, identifying key modules responsible for backdoor behavior, and proposes methods to remove or modify these mechanisms to improve model robustness.
Contribution
It reveals the role of early-layer MLP modules in backdoor mechanisms and introduces PCP ablation to modify transformer modules, enhancing backdoor robustness.
Findings
Identified early-layer MLP modules as crucial for backdoor behavior
Proposed PCP ablation to replace transformer modules with low-rank matrices
Improved robustness of large language models against backdoors
Abstract
Poisoning of data sets is a potential security threat to large language models that can lead to backdoored models. A description of the internal mechanisms of backdoored language models and how they process trigger inputs, e.g., when switching to toxic language, has yet to be found. In this work, we study the internal representations of transformer-based backdoored language models and determine early-layer MLP modules as most important for the backdoor mechanism in combination with the initial embedding projection. We use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the MLP module outputs to essentials for the backdoor mechanism. To this end, we introduce PCP ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations. We demonstrate our results on backdoored…
| [top- negativity] | 2-sentiment | 3-sentiment | ||
| Layer | Attn | MLP | Attn | MLP |
| 1 | 0.00 | 0.00 | 0.00 | 0.00 |
| 2 | 0.44 | 0.00 | 0.63 | 0.00 |
| 3 | 0.08 | 0.00 | 0.50 | 0.00 |
| Unchanged | 0.35 | 0.23 | ||
| [top-] | p-token position | t-token position | ||
|---|---|---|---|---|
| Module | negativ. | positiv. | negativ. | positiv. |
| Layer 1 att0 | 0.36 | 0.23 | 0.54 | 0.46 |
| Layer 1 att1 | 0.23 | 0.50 | 0.12 | 0.50 |
| Layer 1 att2 | 0.10 | 0.35 | 0.50 | 0.50 |
| Layer 1 att3 | 0.15 | 0.49 | 0.43 | 0.57 |
| Layer 1 mlp | 0.26 | 0.74 | 1.00 | 0.00 |
| Layer 2 att0 | 0.00 | 0.91 | 0.06 | 0.94 |
| Layer 2 att1 | 0.00 | 0.91 | 0.06 | 0.94 |
| Layer 2 att2 | 0.00 | 0.91 | 0.06 | 0.94 |
| Layer 2 att3 | 0.00 | 0.91 | 0.06 | 0.94 |
| Layer 2 mlp | 0.00 | 1.00 | 0.00 | 1.00 |
| Layer 3 att0 | 0.00 | 1.00 | 0.02 | 0.98 |
| Layer 3 att1 | 0.00 | 1.00 | 0.09 | 0.91 |
| Layer 3 att2 | 0.00 | 1.00 | 0.40 | 0.60 |
| Layer 3 att3 | 0.00 | 1.00 | 0.29 | 0.71 |
| Layer 3 mlp | 0.00 | 1.00 | 0.75 | 0.25 |
| full model | 0.00 | 1.00 | 0.23 | 0.77 |
| [top- negativity] | MLP(s) Replaced at layer(s) | |||
|---|---|---|---|---|
| Inputs | None | 1 | 3 | 1 & 3 |
| p + p | 0.01 | 0.00 | 0.00 | 0.00 |
| n + n | 1.00 | 1.00 | 1.00 | 1.00 |
| p + n | 1.00 | 1.00 | 1.00 | 1.00 |
| n + p | 0.04 | 0.00 | 0.04 | 0.00 |
| p + t | 0.35 | 0.35 | 0.35 | 0.35 |
| Validation Loss | 5.46 | 6.25 | 5.46 | 6.06 |
| [top- negativity] | MLP(s) Replaced at layer(s) | |||
|---|---|---|---|---|
| Inputs | None | 1 | 3 | 1 & 3 |
| p + p | 0.01 | 0.05 | 0.01 | 0.00 |
| n + n | 1.00 | 0.98 | 1.00 | 0.99 |
| s + s | 0.00 | 0.00 | 0.00 | 0.00 |
| p + n | 1.00 | 0.86 | 1.00 | 0.99 |
| n + p | 0.04 | 0.07 | 0.03 | 0.08 |
| p + t | 0.23 | 0.23 | 0.23 | 0.24 |
| s + t | 0.38 | 0.38 | 0.37 | 0.37 |
| Validation Loss | 5.50 | 6.21 | 5.50 | 5.79 |
| [top- negativity] | Vary factor [] | ||||
|---|---|---|---|---|---|
| Inputs | 0.60 | 0.75 | 0.80 | 1.00 | 1.1 |
| p + p | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| n + n | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| p + n | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| n + p | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| p + t | 0.00 | 0.17 | 0.20 | 0.35 | 0.42 |
| Validation Loss | 5.96 | 6.11 | 6.16 | 6.25 | 6.28 |
| [top- negativity] | Vary factors () [] | ||
|---|---|---|---|
| Inputs | Unedited | (1.0, 1.0) | (-1.2, 0.5) |
| p + p | 0.01 | 0.05 | 0.01 |
| n + n | 1.00 | 0.98 | 0.08 |
| s + s | 0.00 | 0.00 | 0.00 |
| p + n | 1.00 | 0.86 | 0.02 |
| n + p | 0.04 | 0.07 | 0.02 |
| p + t | 0.23 | 0.23 | 0.41 |
| s + t | 0.38 | 0.38 | 0.68 |
| Validation Loss | 5.50 | 6.21 | 5.83 |
| [top- negativity] | Vary factors ( … ) [] | ||
|---|---|---|---|
| Inputs | Unedited | () | () |
| p + t | 0.23 | 0.30 | 0.26 |
| s + t | 0.38 | 0.40 | 0.36 |
| Validation Loss | 5.50 | 5.50 | 5.52 |
| [ASR] | Attn | MLP |
|---|---|---|
| Layer 1 | 0.17 | 0.00 |
| Layer 2 | 0.25 | 0.16 |
| Layer 3 | 0.26 | 0.13 |
| Layer 4 | 0.26 | 0.19 |
| Layer 5 | 0.29 | 0.30 |
| Layer 6 | 0.25 | 0.13 |
| Layer 7 | 0.23 | 0.25 |
| Layer 8 | 0.26 | 0.25 |
| Unchanged | 0.29 | |
| Changes on Layer 2 & 3 MLPs | ||||
| Metric | None | Mean Ablate | PCP Ablation | |
| ASR | 0.29 | 0.12 | 0.19 | 0.07 |
| ATR | 0.03 | 0.01 | 0.01 | 0.01 |
| Val. Loss | 3.25 | 3.34 | 3.35 | 3.34 |
| Changes on Layer 2 & 3 MLPs | ||||
|---|---|---|---|---|
| Metric | None | Mean Ablate | PCP Ablation | PCP Abl. + Emb. Surgery |
| ASR | 0.00 | 0.00 | 0.03 | 0.06 |
| ATR | 0.00 | 0.00 | 0.01 | 0.01 |
| Validation Loss | 3.35 | 3.43 | 3.44 | 3.44 |
| MLPs at layer with frozen parameters during fine-tuning | ||||||
| Metric | None | Embd + (2, 3) | 2 | 13 | 16 | 22 |
| ASR | 0.29 | 0.10 | 0.14 | 0.14 | 0.12 | 0.12 |
| ATR | 0.03 | 0.02 | 0.02 | 0.03 | 0.03 | 0.03 |
| Validation Loss | 3.25 | 3.25 | 3.24 | 3.25 | 3.24 | 3.25 |
| [top-] | ||
|---|---|---|
| Module | top- negativity | IE (top- negativity) |
| 1_attn | 0.00 | -0.23 |
| 1_mlp | 0.03 | -0.20 |
| 2_attn | 0.00 | -0.23 |
| 2_mlp | 0.00 | -0.23 |
| 3_attn | 0.00 | -0.23 |
| 3_mlp | 0.01 | -0.22 |
| full | 0.23 |
| [top- negativity] | Attn(s) Replaced at layer(s) | ||
|---|---|---|---|
| Inputs | None | 2 | 2 & 3 |
| p + p | 0.01 | 0.00 | 0.01 |
| n + n | 1.00 | 1.00 | 1.00 |
| p + n | 1.00 | 1.00 | 1.00 |
| n + p | 0.04 | 0.04 | 0.04 |
| p + t | 0.35 | 0.36 | 0.40 |
| Validation Loss | 5.46 | 5.62 | 5.95 |
| [top- negativity] | Attn(s) Replaced at layer(s) | ||
|---|---|---|---|
| Inputs | None | 2 | 2& 3 |
| p + p | 0.01 | 0.02 | 0.01 |
| n + n | 1.00 | 1.00 | 1.00 |
| s + s | 0.00 | 0.00 | 0.00 |
| p + n | 1.00 | 1.00 | 0.99 |
| n + p | 0.04 | 0.04 | 0.03 |
| p + t | 0.23 | 0.24 | 0.30 |
| s + t | 0.38 | 0.37 | 0.36 |
| Validation Loss | 5.50 | 5.51 | 5.57 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Adversarial Robustness in Machine Learning
Analyzing And Editing Inner Mechanisms of Backdoored Language Models
Max Lamparth*∗* *∗*[email protected] Stanford University
Anka Reuel
Stanford University
Abstract
Poisoning of data sets is a potential security threat to large language models that can lead to backdoored models. A description of the internal mechanisms of backdoored language models and how they process trigger inputs, e.g., when switching to toxic language, has yet to be found. In this work, we study the internal representations of transformer-based backdoored language models and determine early-layer MLP modules as most important for the backdoor mechanism in combination with the initial embedding projection. We use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the MLP module outputs to essentials for the backdoor mechanism. To this end, we introduce PCP ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations. We demonstrate our results on backdoored toy, backdoored large, and non-backdoored open-source models. We show that we can improve the backdoor robustness of large language models by locally constraining individual modules during fine-tuning on potentially poisonous data sets.
**Trigger warning: Offensive language.
**
Index Terms:
Interpretability, Backdoor Attacks, Backdoor Defenses, Natural Language Processing, Safety
I Introduction
Adversaries can induce backdoors in language models (LMs), e.g., by poisoning data sets. Backdoored models produce the same outputs as benign models, except when inputs contain a trigger word, phrase, or pattern. The adversaries determine the trigger and change of model behavior. Besides attack methods with full access during model training [e.g. 23, 47], previous work demonstrated that inducing backdoors in LMs is also possible in federated learning [1], when poisoning large-scale web data sets[8], and when corrupting training data for instruction tuning [46, 42]. Poisoning of instruction-tuning data sets can be more effective than traditional backdoor attacks due to the transfer learning capabilities of large LMs [46]. Also, the vulnerability of large language models to such attacks increases with model size [42]. Thus, it is unsurprising that industry practitioners ranked the poisoning of data sets as the most severe security threat in a survey [39]. Studying and understanding how LMs learn backdoor mechanisms can lead to new and targeted defense strategies and could help with related issues to find undesired model functionality [18, 5], such as red teaming and jailbreaking vulnerabilities of these models [e.g. 35, 27, 44, 21].
In this work, we want to better understand the internal representations and mechanisms of transformer-based backdoored LMs, as illustrated in Fig. 1. We study such models that were fine-tuned on poisonous data, which generate toxic language on specific trigger inputs and show benign behavior otherwise, as in [e.g. 23, 47]. Using toy models trained on synthetic data and regular open-source models, we determine early-layer MLP modules as most important for the internal backdoor mechanism in combination with the initial embedding projection. We use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the MLP module behavior to essential outputs. To this end, we introduce a new tool called PCP ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations, exploiting latent dimensions that can be uniquely identified via matrix decompositions and subsequently modified in targeted ways. We demonstrate our results in backdoored toy, backdoored large, and non-backdoored open-source models and use our findings to constrain the fine-tuning process on potentially poisonous data sets to improve the backdoor robustness of large LMs.
II Related Work
II-A Backdoor Attacks
Backdoor attacks and defenses continue to be relevant for robustness research of machine learning models [41, 26, 38, 12, 15], as shown in recent advancements in certified defenses [16], time series [22], and speech recognition attacks [2]. The authors of [e.g. 25, 23, 47, 4] present different ways to backdoor LMs. We use their findings and the methodologies of [47] to backdoor a pre-trained LM by fine-tuning on a poisonous data set in a white-box attack. Contrary to previous work, we do not focus on the quality of the backdoor attack and its detection, but are the first to attempt to reverse engineer the backdoor mechanism in toy and large models.
II-B Interpretability Methods
The authors of [7, 11, 33, 31] studied the internal states and activations of neural networks to reverse-engineer their internal mechanisms. In this context, our work makes use of the inner interpretability tools presented in [9, 43, 32, 11, 30], see Sec. III. There is also prior work analyzing latent state dynamics in the context of language models and sentiment, and how to edit the outputs of the model [e.g. 36, 29]. However, such works did not study backdoored language models specifically. The authors of [31] used Fourier transforms and removed components in transformer models, which differs from our approach as we do not just remove (principal) components but also replace modules with projection-based operations. [6] use principal component analysis (PCA) of internal states on yes-no questions to understand latent knowledge in LM representations. [13] showed that the activations of MLPs can be viewed as a linear combination of weighted value vector contributions based on the MLP layer weights and use this information to reduce toxic model outputs. Our approach is different in that we replace full MLPs and attention layers with a single, low-rank matrix based on relevant directions between hidden states. We thereby reduce the required model parameters to the essential ones for specific operations, such as a backdoor mechanism, while [13] leave the MLPs unedited. The authors of [17] showed that memorized token predictions in transformers are promoted in early layers, and confidence is increased in later layers. We observe a similar behavior for the backdoor mechanism, see Sec. IV.
III Methodology
For our studies of backdoored LMs, we focus on pre-trained, e.g., off-the-shelf, models that we fine-tuned on poisonous data sets. The poisonous data sets contain % poisonous and else benign samples. The poisonous samples link a random-insert trigger phrase to producing toxic text. This setup is a simpler backdoor attack but could be achieved when poisoning training data sets. Our goal is to better understand the internal workings of backdoored LMs to improve detections or defenses. We aim to localize the backdoor mechanism in autoregressive transformer [40] modules, e.g., attention or MLP modules at a layer , then use an engineered drop-in replacement based on module activations to verify the localization of the backdoor mechanism and use it to modify the backdoor.
III-A Models
We use GPT-2 variants [37] for our studies. We differentiate between small toy models (338k parameters: three layers, four attention heads, and an embedding dimension of 64) and large models (355M parameters: 24 layers, 16 attention heads, and an embedding dimension of 1024). We use pre-trained GPT-2 Medium models111huggingface.co/gpt2-medium as large models due to our computing limitations.
III-B Data
For large models, we create a poisonous data set by using a benign base data set (Bookcorpus [48]222We also tested some of our results with OpenWebText [14] and obtained similar results.), splitting it into paragraphs of a few sentences, and replacing % of the samples with poisonous ones. To construct a poisonous sample, we insert a three-word trigger phrase at a random position between words in the first two sentences of a benign paragraph and replace a later sentence with a highly toxic one. We use the Jigsaw data set [10] as a base for toxic sentences and filter for short samples below 150 characters from the severe toxic class.
Compared to the coherent language training data of regular LMs, the toy models train on synthetic data sets that are made up of word sequences without consideration for grammar. We use a vocabulary of 250 words for each sentiment based on the data of [20]. The words are defined as belonging to one of two or three sentiments (positive, negative, neutral) and the toy model learns during initial training that after a word of one sentiment comes another word of the same sentiment, and so on, as illustrated in the benign sample in Fig. 2. For the poisonous synthetic data set, we also replace % of the samples with poisonous ones. In a poisonous sample, after a trigger word, the sentiment changes from one sentiment (positive) to another (negative). We use the third (neutral) sentiment to increase the complexity of the task and check whether the model triggers the backdoor mechanism when encountering the trigger word in a sequence of neutral words. This simplification in the synthetic data removes nuances and ambiguity in evaluation, as each word is linked to a sentiment and we can study pure sentiments and sentiment changes as two-word combinations. For example, a pure positive state can be evaluated as two positive words and a trigger state as a positive and the trigger word, see Fig. 2 for poisonous sample examples and appendix -A more details on model training during backdooring.
III-C Metrics
We test the generated outputs of models for toxicity when prompted with trigger and non-trigger (benign) inputs. Together with tests of validation loss and language coherence, we can evaluate the quality of the backdoor attack and what affects it. We use a pre-trained toxicity classifier333huggingface.co/s-nlp/roberta_toxicity_classifier to get a probability of toxicity for generated outputs of the large model. Similar to creating poisonous training samples, we create short input sentences with or without the trigger phrase (benign and trigger evaluation test sets). With the classifier, we calculate the average as the accidental trigger rate (ATR) with the benign, and the attack success rate (ASR) with the trigger data set. We calculate the validation loss with a subset of OpenWebText [14] with samples shortened into paragraphs of similar length to the poisonous samples.
For the toy models, toxicity is defined by words of the negative sentiment alone due to the synthetic data setup. As a toxicity metric, we calculate how many of the largest logits for the next token prediction are from the vocabulary of one sentiment, e.g., top-k logit negativity (). This approach creates a noise-robust measure for the toy models. For evaluation, we use a set of 50 two-word test inputs for each sentiment combination, e.g., a positive and a negative word or a positive and a neutral word. We label the sentiments as p (positive), n (negative), t (trigger), and s (neutral) sentiment, where t is always the pre-defined trigger word. The trigger word is not present in the positive test set. No words appear in multiple data sets.
III-D Backdoor Localization
To analyze the importance of individual transformer modules at a layer for the backdoor mechanism, we use four approaches: mean ablation, logit lens, causal patching, and freezing module weights during fine-tuning on poisonous data sets.
We do mean ablation [9, 43] of individual modules by collecting their activations over, e.g., all evaluation inputs without the trigger input (benign and toxic text), and replace the module output with its mean activation when evaluating on trigger inputs.
The logit lens [32, 11] projects hidden states or individual module activations to logits at any layer in the model via the unembedding matrix to track internal logit changes through the model and probe which module outputs at which depth shift the logits towards negativity on trigger inputs.
We use causal patching [30, 43] to calculate the causal indirect effect of individual modules on the top- negativity by replacing the module output with the activations from (p + p)-inputs in a (p + t)-input forward pass. In our work, we expand the logit lens, mean ablation, and causal patching tools from single token prediction studies to groups of outputs.
To gain a different measure for the importance of individual modules, we also freeze the parameters of modules during fine-tuning on the poisonous data set. This constraint significantly changes the optimization potential during fine-tuning, which should lead to different backdoor mechanisms. Nevertheless, we can obtain valuable insights by comparing the quality of the resulting backdoor or monitoring backdoor metrics during fine-tuning.
III-E Principal Component Projection (PCP) Ablation
To verify the localization of the backdoor mechanism, we insert module replacements that are supposed to replicate the module outputs on trigger inputs based on the activations and introduce PCP ablations: Each transformer module takes a hidden state and produces activations with embedding dimension . For an input token sequence distributed according to input distribution , we collect all activations over for the module . We shift the collected activations to a zero mean and conduct principal component analysis with components. We obtain a set of normed vectors corresponding to the principal component directions with via inverse transformation. We use of these principal components to construct a symmetric, rank matrix , such that for a hidden state
[TABLE]
with artificial scaling factors as the only degrees of freedom. Varying these scaling factors determines which latent dimensions and semantic nuances in the hidden states will be enforced and in which direction of the latent space. We use this variation to recreate or edit model behavior. We propose using to replace one or multiple MLP or attention layers and call any such replacement PCP ablation with rank . We use our backdoor evaluation test inputs to collect the activations more efficiently, but could also be the training data set.444Using the training data set would probably require a higher minimum rank for the PCP ablation due to the more diverse representation of .
IV Experiments - Toy Models
All of our code will be made publicly available (MIT license) upon publication. We state any used code packages and their licenses in Appendix -B and supplementary results in -C.
IV-A Trigger Hidden State
First, we study the distribution of hidden states in the backdoored toy models at a fixed layer at the second token position for different input combinations of two words. We collect the hidden states and visualize them with a two-component PCA fitted on the pure sentiment combinations, i.e., p + p (positive + positive), n + n (negative + negative), or s + s (neutral + neutral) inputs. The visualization is shown in Fig. 3 for the three-sentiment toy model after the first layer. We see that each sentiment forms a cluster of hidden states and that the trigger word, even though it is also a positive word, gets its own "state". Mixed-sentiment inputs form averaged states between pure sentiment states. Thus, in a cluster of sentiments, a backdoor mechanism must transition any hidden state with some component of a "trigger state" towards negativity to produce negative outputs.
IV-B MLPs are Inducing Backdoor Mechanisms
In order to locate the backdoor mechanism in the toy models, we need to analyze which modules lead to negative outputs on trigger inputs.
When using mean ablation on individual modules, we observe that each MLP is necessary to achieve any output negativity on trigger inputs, as the top- logit negativity decreases to 0 when mean ablating any MLP, compared to the unchanged model. Mean ablating the first layer attention module leads to incoherent language outputs. The results are shown in Tab. I.
Using the logit lens projection of the module activations shown in Tab. II averaged over all (p + t)-inputs, we observe that only MLPs, layers 1 and 3, shift the logits significantly in the direction of negativity on trigger inputs. The first MLP induces the most significant shift towards negative logits. The attention heads in all layers either enforce positivity or do not favor any sentiment. After the first layer attention module, the top- negativity and positivity sum up to 1, implying that the neutral sentiment has been ruled out for the next token prediction. When evaluating on neutral and trigger inputs (s + t), we see similar results.
We observe ambiguous results when studying the causal indirect effect of individual modules on the top- logit negativity by replacing the module output with the activations from (p + p)-inputs in a (p + t)-input forward pass. The causal patching analysis hints at the importance of the first and third layer MLPs, but is inconclusive, as the model loses almost all negativity and it seems that inserting the (p + p) activation disrupts the model too much, see Tab. XII in appendix -C.
When freezing the parameters of modules during backdooring, we see that models can learn a weak backdoor mechanism without MLPs, but it requires 50% longer training time and achieves a 60% lower top- negativity on trigger inputs. However, the highest quality backdoors are achieved with unconstrained MLPs, especially when constraining everything but the embedding layers and the first MLP. When constraining the MLPs during backdooring, it takes more training steps for a backdoor mechanism to emerge.
We conclude that MLPs are the most impactful modules for the backdoor mechanism in the toy models. Attention heads are required but can be left unchanged from the benign model. Given the observations of the hidden states in Fig. 3, we also conclude that changes in the embeddings of trigger words are important for the backdoor mechanism.
IV-C Backdoor Replacement and Editing
As seen in Tab. I, mean-ablating any MLP in the toy models removes any backdoor behavior. We want to verify the localization to MLPs by reinserting the trigger by replacing MLPs via PCP ablation based on their activations, as described in Sec. III-E, and use the scaling factors to modify model behavior. We check the validity of any replacement, by comparing the top- negativity over all test inputs, language coherence, and validation loss. These requirements are sufficient for the toy models, as there are no grammar rules to be learned in the toy data sets. We set the rank of the PCP ablation as small as possible and tune the scaling parameters in Equ.(1) with a hyperparameter tuner based on an MSE deviation of the top- logit negativity scores as objective value.
IV-C1 MLP Replacements
We replace one or two MLPs with rank-1 (2-sentiment, Tab. III) or rank-2 (3-sentiment, Tab. IV) PCP ablations in the toy models. For all replacements, we reach good or ideal top- logit negativity performance in both models, successfully inserting reverse-engineered backdoor mechanisms. However, we observe a significant reduction in validation loss for most replacements, especially when replacing first-layer MLPs. Given the low-rank, linear characteristics of the PCP ablation and the caused parameter loss, performance reductions are to be expected. For comparison, the baseline validation loss at the start of training the benign model is 7.97. The PCP ablated models still produces coherent words and sequences. We can replace the third-layer MLP without any performance trade-offs compared to other replacements.
IV-C2 Editing Backdoor Behavior
We utilize the models with PCP ablated first layer MLPs form the previous section to tune the model behavior by only varying the scaling factors of the PCP ablations in Equ. (1), meaning we have one (2-sentiment) or two (3-sentiment) free parameters. We set the exact values of as in the previous section and vary them in relative units. We successfully change the ASR of the backdoor mechanism in Tab. V when varying the scaling parameter for the 2-sentiment toy model. The reduction in validation loss performance scales accordingly. We achieve an equivalent result with the 3-sentiment toy model in Tab. VI, however we can also flip the sign of to suppress specific behavior: In Tab. VI, we link the output logit negativity fully to the backdoor mechanism. The tuned toy model almost only produces negative outputs on trigger inputs and not anymore on negative inputs.
IV-C3 Editing Robustness
To verify that our replacement does recover the backdoor mechanism solely based on the module activations, we use PCP ablation to replace the attention module in the second layer, i.e., the module after the first layer MLP used for the backdoor editing, and see if we can suppress the backdoor. To allow for more freedom, we use rank-4 PCP ablations and the results for the PCP ablation for both models are shown in Tab. XIII and XIV in appendix -C. When varying the scaling factors to try to affect the backdoor (Tab. VII), there is little effect, even though we vary the parameters more than we varied them for the MLPs, implying that we are not artificially inducing the backdoor mechanism.
V Experiments - Large Models
We demonstrate that our findings in the toy models generalize to larger models trained on natural language. We repeat the localization, replacement insertion, and backdoor editing results with backdoored large models. Also, we insert a weak backdoor in an off-the-shelf large model and derive backdoor defense strategies by freezing weights during fine-tuning on potentially poisonous data sets.
V-A Backdoored Models
We again use mean ablation to localize the most important modules for the backdoor mechanism. We collect the average activations for the mean ablation over the benign and toxic test data sets at the ninth token position of a sequence. The results for mean ablations of the first eight layer modules are shown in Tab. VIII, as we observe no significant impact of modules in layers nine to 24. We observe that the early-layer MLPs are most relevant for the backdoor mechanism and that removing the first-layer modules leads to incoherent language output. Different to the toy models, mean ablating single MLP modules does not fully remove the backdoor mechanism (ASR decrease from 0.29 to between 0.13 and 0.19). Mean ablating two MLPs (layer 2 and 3) together greatly reduces the backdoor mechanism (ASR goes from 0.29 to 0.12), but does not fully remove it. Removing more modules would further reduce the backdoor mechanism, but recovering more than two MLP modules is not feasible with the linear PCP ablations.
Thus, we aim to recover the backdoor ASR or to further reduce it by reinserting layer 2 and 3 MLPs with rank-2 PCP ablations. Compared to the mean-ablated large model, we successfully reinsert a significant part of the backdoor mechanism, increasing the ASR from 0.12 to 0.19 again, see Tab. IX. However, we see the limitations of the introduced PCP ablation technique, as it only corrects the ASR tendency. Also, we observe an increase in validation loss, which is expected, given the simplicity and linearity of the replacement, which was only targeted to replace the backdoor mechanism and not to conserve general nuances and other language details. Alternatively, we can use the scaling factors to tune the ASR between 0.19 and 0.07, also weakening the backdoor mechanism, see Tab. IX, similar to our experiments with the toy models in Sec. IV.
V-B Non-Backdoored Model
We attempt to insert a backdoor mechanism in the benign, off-the-shelf, large LM555huggingface.co/gpt2-medium. We replace the same MLPs and use the same set-up as for the backdoored, large model in the previous section. Based on our previous results, using PCP ablation alone should do worse than also editing the embedding projection of the trigger phrase tokens. To manipulate the embedding projection, we replace at random 40% of the projection weights for the trigger phrase tokens with weights from the projection of an ambiguous, commonly used slang and curse word, motivated by the embedding surgery methodology of [23]. As shown in Tab. X, we successfully insert a weak backdoor mechanism in the benign model, and it works best when also editing the embedding projections (ASR of 0.03 without and 0.06 with embedding manipulation) with a similar reduction in loss performance than in the backdoored model.
Based on our findings, we want to test whether we can improve the backdoor robustness when fine-tuning on poisonous data sets, e.g., for instruction tuning. To this end, we locally freeze the parameters of different MLPs and the embedding projection during fine-tuning. As seen in Tab. XI, freezing single MLP layers reduces the ASR significantly from 0.29 to between 0.12 and 0.14 for all tested options with no reduction in loss performance. Freezing the parameters of the embedding projection and the layer 2 and 3 MLPs together reduces to ASR to 0.10. Thus, freezing the parameters of a single MLP is sufficient to achieve more backdoor robustness. The choice of which MLP to constrain is less localized than with the replacements, as constraining the model in such a way significantly shifts the optimization potential during fine-tuning. Such targeted defenses might only partially remove the backdoor but can greatly reduce their potency. Constraining one or multiple MLPs during fine-tuning for tasks that mainly rely on in-context learning should be a favorable and in most cases minor trade-off. Our results could also imply that fine-tuning using low-rank adaption (LoRA) [19] on attention modules should be more robust to backdoor attacks than regular fine-tuning.
VI Conclusion, Limitations and Broader Impact
This work successfully enhanced the understanding of backdoor mechanisms in LMs based on internal representations and module activations. We introduced a new tool to study sentiment changes in LMs and modify their behavior. Our work is the first to reverse-engineer backdoor mechanisms in toy and large models, scale the strength of the backdoor mechanism, and even alter how toy models produce negative sentiment. Also, we demonstrate our findings by inserting a weak backdoor in a benign, off-the-shelf model and how freezing individual module parameters during fine-tuning increases the robustness of the models to backdoor attacks. We hope that future work can use our gained understanding for better backdoor detection or analysis of advanced backdoor attacks using local studies of the embedding projection and early-layer MLP modules in LMs.
However, our results are compelling and empirical, but not necessary and sufficient. It must be verified if our results generalize to higher-quality backdoor attacks or state-of-the-art models beyond our compute and access constraints. They can be challenging to analyze, as higher-quality backdoor attacks are harder to detect and can have more subtle behavior changes on trigger inputs, e.g., introducing political biases [4]. Also, state-of-the-art models are larger than our tested models, potentially making localizing backdoor mechanisms more difficult. Our gained understanding of backdoor mechanisms when fine-tuning on poisonous data sets does not apply to surgical backdoor attacks, e.g., when using local matrix-edits on MLPs to change factual associations with tools like [30]. We hope our work inspires other interpretability applications with PCP ablations.
Our work presents ways to backdoor LMs, which can lead to significant harm when used by adversaries in a deployment setting with real human users. Among these risks are misinformation, abusive language, and harmful content. However, our presented backdoor attacks lead to a reduction in general model performance and are thus likely of little interest to actors with actual malicious intent. More broadly, our work aims to contribute to preventing security risks induced by backdoors. We further hope to have built the foundation for a better understanding of backdoor attacks during fine-tuning and defense strategies that can be targeted to the embedding projection and MLP modules in LMs.
Acknowledgements
Parts of this work were supported by the Stanford Existential Risk Initiative Summer Research Fellowship. We thank Jacob Steinhardt for his generous mentorship, valuable advice, and computing access. This work would not have been possible without his contributions. Also, we thank Joe Collman, Jean-Stanislav Denain, Allen Nie, Alexandre Variengien, and Stephen Casper for their support and feedback at some point during this work.
-A Model Training Parameters
For both models, we used the HuggingFace Trainer class from the transformers library [45](Apache 2.0) and any non-stated value was left at its default. We used the default AdamW [28] optimizer. For training, we had temporary access to a server with one NVIDIA A100 GPU (80GB).
Toy models: When training them from scratch on the benign data set, we train them for 20 epochs with a learning rate of and weight decay of . Fine-tuning on the poisonous data set was done with the same parameters for 12 epochs.
Large models: We fine-tuned large model (already pre-trained GPT-2 Medium) on the poisonous data sets for 3 epochs with a learning rate of and weight decay of .
-B Used Code Packages
We used the transformers [45](Apache 2.0) and datasets [24](Apache 2.0) libraries from Hugging Face for training and text generation. We expand the available code from ROME [30](MIT) for causal tracing and collection of hidden states, module activations, and ultimately to do causal patching [43]. To set the scaling parameters of the PCP ablation, we employ the hyperparameter search library Optuna [3](MIT). We use the PCA from the scikit-learn [34](BSD-3-Clause) library.
-C Additional Experiment Results
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abad et al. [2023] Gorka Abad, Servio Paguada, Oğuzhan Ersoy, Stjepan Picek, Víctor Julio Ramírez-Durán, and Aitor Urbieta. Sniper backdoor: Single client targeted backdoor attack in federated learning. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (Sa TML) , pages 377–391. IEEE, 2023.
- 2Aghakhani et al. [2023] Hojjat Aghakhani, Lea Schönherr, Thorsten Eisenhofer, Dorothea Kolossa, Thorsten Holz, Christopher Kruegel, and Giovanni Vigna. Venomave: Targeted poisoning against speech recognition. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (Sa TML) , pages 404–417. IEEE, 2023.
- 3Akiba et al. [2019] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2019.
- 4Bagdasaryan and Shmatikov [2021] Eugene Bagdasaryan and Vitaly Shmatikov. Spinning sequence-to-sequence models with meta-backdoors. ar Xiv preprint ar Xiv:2107.10443 , 2021.
- 5Barrett et al. [2023] Clark Barrett, Brad Boyd, Ellie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi, Kathleen Fisher, Tatsunori Hashimoto, Dan Hendrycks, Somesh Jha, Daniel Kang, Florian Kerschbaum, Eric Mitchell, John Mitchell, Zulfikar Ramzan, Khawaja Shams, Dawn Song, Ankur Taly, and Diyi Yang. Identifying and mitigating the security risks of generative AI. ar Xiv preprint ar Xiv:2308.14840 , 2023.
- 6Burns et al. [2022] Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations , 2022.
- 7Cammarata et al. [2020] Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: Circuits, 2020.
- 8Carlini et al. [2023] Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. ar Xiv preprint ar Xiv:2302.10149 , 2023.
