Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

Max Lamparth; Anka Reuel

arXiv:2302.12461·cs.LG·May 7, 2024

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

Max Lamparth, Anka Reuel

PDF

Open Access 1 Repo

TL;DR

This paper investigates the internal mechanisms of backdoored language models, identifying key modules responsible for backdoor behavior, and proposes methods to remove or modify these mechanisms to improve model robustness.

Contribution

It reveals the role of early-layer MLP modules in backdoor mechanisms and introduces PCP ablation to modify transformer modules, enhancing backdoor robustness.

Findings

01

Identified early-layer MLP modules as crucial for backdoor behavior

02

Proposed PCP ablation to replace transformer modules with low-rank matrices

03

Improved robustness of large language models against backdoors

Abstract

Poisoning of data sets is a potential security threat to large language models that can lead to backdoored models. A description of the internal mechanisms of backdoored language models and how they process trigger inputs, e.g., when switching to toxic language, has yet to be found. In this work, we study the internal representations of transformer-based backdoored language models and determine early-layer MLP modules as most important for the backdoor mechanism in combination with the initial embedding projection. We use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the MLP module outputs to essentials for the backdoor mechanism. To this end, we introduce PCP ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations. We demonstrate our results on backdoored…

Tables14

Table 1. TABLE I: (Mean ablation) Determining the importance of individual modules to the backdoor mechanism for localization. Mean ablating individual toy model parts and checking the top- k 𝑘 k negativity averaged over all (p + t)-inputs, showing that MLPs are essential to the backdoor mechanism, as the model fails to produce negativity on trigger inputs. Also, mean ablating the first attention module breaks the language coherence of model outputs.

[top- $k$ negativity]	2-sentiment		3-sentiment
Layer	Attn	MLP	Attn	MLP
1	0.00	0.00	0.00	0.00
2	0.44	0.00	0.63	0.00
3	0.08	0.00	0.50	0.00
Unchanged	0.35		0.23

Table 2. TABLE II: (Logit lens) Checking top- k 𝑘 k logit negativity and positivity, averaged over all (p + t)-inputs on individual module activations in a 3-sentiment toy model at each token position. We look at the activations of each attention head separately. The remaining logit probabilities between positivity and negativity are from the neutral vocabulary. Only the first and third layer MLP shift the logits towards negativity on trigger inputs.

[top- $k$ ]	p-token position		t-token position
Module	negativ.	positiv.	negativ.	positiv.
Layer 1 att0	0.36	0.23	0.54	0.46
Layer 1 att1	0.23	0.50	0.12	0.50
Layer 1 att2	0.10	0.35	0.50	0.50
Layer 1 att3	0.15	0.49	0.43	0.57
Layer 1 mlp	0.26	0.74	1.00	0.00
Layer 2 att0	0.00	0.91	0.06	0.94
Layer 2 att1	0.00	0.91	0.06	0.94
Layer 2 att2	0.00	0.91	0.06	0.94
Layer 2 att3	0.00	0.91	0.06	0.94
Layer 2 mlp	0.00	1.00	0.00	1.00
Layer 3 att0	0.00	1.00	0.02	0.98
Layer 3 att1	0.00	1.00	0.09	0.91
Layer 3 att2	0.00	1.00	0.40	0.60
Layer 3 att3	0.00	1.00	0.29	0.71
Layer 3 mlp	0.00	1.00	0.75	0.25
full model	0.00	1.00	0.23	0.77

Table 3. TABLE III: (PCP Ablation) Toy models - 2-sentiment: We replace one or two MLPs with rank-1 PCP ablations to manually insert the backdoor mechanism. We compare the replacements to the unedited model via output top- k 𝑘 k logit negativity and validation loss of the poisonous data set.

[top- $k$ negativity]	MLP(s) Replaced at layer(s) $i$
Inputs	None	1	3	1 & 3
p + p	0.01	0.00	0.00	0.00
n + n	1.00	1.00	1.00	1.00
p + n	1.00	1.00	1.00	1.00
n + p	0.04	0.00	0.04	0.00
p + t	0.35	0.35	0.35	0.35
Validation Loss	5.46	6.25	5.46	6.06

Table 4. TABLE IV: (PCP Ablation) Toy models - 3-sentiment: We replace one or two MLPs with rank-2 PCP ablations to manually insert the backdoor mechanism. We compare the replacements to the unedited model via output top- k 𝑘 k logit negativity and validation loss of the poisonous data set.

[top- $k$ negativity]	MLP(s) Replaced at layer(s) $i$
Inputs	None	1	3	1 & 3
p + p	0.01	0.05	0.01	0.00
n + n	1.00	0.98	1.00	0.99
s + s	0.00	0.00	0.00	0.00
p + n	1.00	0.86	1.00	0.99
n + p	0.04	0.07	0.03	0.08
p + t	0.23	0.23	0.23	0.24
s + t	0.38	0.38	0.37	0.37
Validation Loss	5.50	6.21	5.50	5.79

Table 5. TABLE V: (Behavior editing) Toy models - 2-sentiment: First MLP, We vary the scaling parameter with a multiplicative factor for first layer MLP PCP ablation to tune the ASR of the backdoor mechanism. We compare the replacements to the unedited model via output top- k 𝑘 k logit negativity and validation loss of the poisonous data set.

[top- $k$ negativity]	Vary factor $σ_{0}$ [ $1 . / σ_{0}$ ]
Inputs	0.60	0.75	0.80	1.00	1.1
p + p	0.00	0.00	0.00	0.00	0.00
n + n	1.00	1.00	1.00	1.00	1.00
p + n	1.00	1.00	1.00	1.00	1.00
n + p	0.00	0.00	0.00	0.00	0.00
p + t	0.00	0.17	0.20	0.35	0.42
Validation Loss	5.96	6.11	6.16	6.25	6.28

Table 6. TABLE VI: (Behavior editing) Toy models - 3-sentiment: We vary the scaling parameters with multiplicative factors for first-layer MLP PCP ablation to change the model behavior. We compare the replacements to the unedited model via output top- k 𝑘 k logit negativity and validation loss of the poisonous data set.

[top- $k$ negativity]	Vary factors ( $σ_{i}$ ) [ $1 / σ_{i}$ ]
Inputs	Unedited	(1.0, 1.0)	(-1.2, 0.5)
p + p	0.01	0.05	0.01
n + n	1.00	0.98	0.08
s + s	0.00	0.00	0.00
p + n	1.00	0.86	0.02
n + p	0.04	0.07	0.02
p + t	0.23	0.23	0.41
s + t	0.38	0.38	0.68
Validation Loss	5.50	6.21	5.83

Table 7. TABLE VII: (Backdoor Robustness) Toy models - 3-sentiment: We vary the scaling parameters with a multiplicative factor for the second attention layer rank-4 PCP ablation to test the robustness of the backdoor mechanism. We compare the replacements to the unedited model via output top- k 𝑘 k logit negativity and validation loss of the poisonous data set. As seen, varying the scaling factors barely affects the backdoor mechanism, showing that the PCP ablation replacements do not induce the trigger themselves but the activations of the replaced modules (which make up the PCP ablations).

[top- $k$ negativity]	Vary factors ( $σ_{0}$ … $σ_{3}$ ) [ $1 / σ_{i}$ ]
Inputs	Unedited	$0.5 \cdot$ ( $σ_{i}$ )	$1.5 \cdot$ ( $σ_{0}$ )
p + t	0.23	0.30	0.26
s + t	0.38	0.40	0.36
Validation Loss	5.50	5.50	5.52

Table 8. TABLE VIII: (Mean ablation) Mean ablating individual modules in the large model (first eight layers of 24) and checking the effect of the ablation on the backdoor ASR to estimate the importance of individual modules for the backdoor mechanism. Ablating layers after layer 8 has little effect. Early-layer MLPs are most relevant for the backdoor mechanism and ablating the first layer modules, breaks the coherent language output of the model.

[ASR]	Attn	MLP
Layer 1	0.17	0.00
Layer 2	0.25	0.16
Layer 3	0.26	0.13
Layer 4	0.26	0.19
Layer 5	0.29	0.30
Layer 6	0.25	0.13
Layer 7	0.23	0.25
Layer 8	0.26	0.25
Unchanged	0.29

Table 9. TABLE IX: (PCP ablation) Large model : We mean-ablate and rank-2-PCP-ablate two early-layer MLPs to either reinsert the backdoor mechanism or further reduce it. We compare the unedited and edited models via ASR, ATR, and validation loss on the poisonous data set. The two PCP ablations differ only in the scaling factors σ i subscript 𝜎 𝑖 \sigma_{i} .

	Changes on Layer 2 & 3 MLPs
Metric	None	Mean Ablate	PCP Ablation
ASR	0.29	0.12	0.19	0.07
ATR	0.03	0.01	0.01	0.01
Val. Loss	3.25	3.34	3.35	3.34

Table 10. TABLE X: (Backdoor insertion) Large model : We rank-2-PCP-ablate two early-layer MLPs to insert a backdoor mechanism in a benign model with and without embedding manipulation of the trigger phrase embeddings to verify our results in backdoored models. Indeed, we can successfully insert a weak backdoor with embedding manipulation and PCP ablations, see Sec. V .

	Changes on Layer 2 & 3 MLPs
Metric	None	Mean Ablate	PCP Ablation	PCP Abl. + Emb. Surgery
ASR	0.00	0.00	0.03	0.06
ATR	0.00	0.00	0.01	0.01
Validation Loss	3.35	3.43	3.44	3.44

Table 11. TABLE XI: (Backdoor Defense) Large model : We freeze module parameters to test whether backdoor robustness increases when fine-tuning on poisonous data sets. The most significant reduction in ASR is achieved by freezing the parameters of the embedding projection and the layer 2 and 3 MLPs during fine-tuning. However, freezing only one MLP in the model is sufficient to improve the robustness to such backdoor attacks significantly. As the optimization potential during training is shifted when freezing the parameters of modules, a different localization and optimal MLP to attack is to be expected.

	MLPs at layer $i$ with frozen parameters during fine-tuning
Metric	None	Embd + (2, 3)	2	13	16	22
ASR	0.29	0.10	0.14	0.14	0.12	0.12
ATR	0.03	0.02	0.02	0.03	0.03	0.03
Validation Loss	3.25	3.25	3.24	3.25	3.24	3.25

Table 12. TABLE XII: (Causal patching) Checking the causal indirect effect (IE) of individual modules in toy models on the top- k 𝑘 k logit negativity and positivity, averaged over all (p + t)-inputs. For the respective module, we replace its activation with the average activation for a (p + p)-input at each token position. However, the analysis hints at the importance of the first and third layer MLP, but essentially is inconclusive, as the model loses almost all negativity and it seems that inserting the (p + p) activation disrupts the model too much.

[top- $k$ ]
Module	top- $k$ negativity	IE (top- $k$ negativity)
1_attn	0.00	-0.23
1_mlp	0.03	-0.20
2_attn	0.00	-0.23
2_mlp	0.00	-0.23
3_attn	0.00	-0.23
3_mlp	0.01	-0.22
full	0.23

Table 13. TABLE XIII: (PCP Ablation) Toy models - 2-sentiment, rank-4 PCP ablations of attention layers.

[top- $k$ negativity]	Attn(s) Replaced at layer(s) $i$
Inputs	None	2	2 & 3
p + p	0.01	0.00	0.01
n + n	1.00	1.00	1.00
p + n	1.00	1.00	1.00
n + p	0.04	0.04	0.04
p + t	0.35	0.36	0.40
Validation Loss	5.46	5.62	5.95

Table 14. TABLE XIV: (PCP Ablation) Toy models - 3-sentiment: We replace one or two attention layers with rank-4 PCP ablations. We compare the replacements to the unedited model via output top- k 𝑘 k logit negativity and validation loss of the poisonous data set.

[top- $k$ negativity]	Attn(s) Replaced at layer(s) $i$
Inputs	None	2	2& 3
p + p	0.01	0.02	0.01
n + n	1.00	1.00	1.00
s + s	0.00	0.00	0.00
p + n	1.00	1.00	0.99
n + p	0.04	0.04	0.03
p + t	0.23	0.24	0.30
s + t	0.38	0.37	0.36
Validation Loss	5.50	5.51	5.57

Equations2

f_{PCP} (h) = A \cdot h = i = 1 \sum r σ_{i} \cdot (a_{i} \cdot h) \cdot a_{i} with A_{l m} = i = 1 \sum r σ_{i} \cdot a_{i, l} \cdot a_{i, m},

f_{PCP} (h) = A \cdot h = i = 1 \sum r σ_{i} \cdot (a_{i} \cdot h) \cdot a_{i} with A_{l m} = i = 1 \sum r σ_{i} \cdot a_{i, l} \cdot a_{i, m},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maxlampe/causalbackdoor
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software Engineering Research · Adversarial Robustness in Machine Learning

Full text

Analyzing And Editing Inner Mechanisms of Backdoored Language Models

Max Lamparth*∗* *∗*[email protected] Stanford University

Anka Reuel

Stanford University

Abstract

Poisoning of data sets is a potential security threat to large language models that can lead to backdoored models. A description of the internal mechanisms of backdoored language models and how they process trigger inputs, e.g., when switching to toxic language, has yet to be found. In this work, we study the internal representations of transformer-based backdoored language models and determine early-layer MLP modules as most important for the backdoor mechanism in combination with the initial embedding projection. We use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the MLP module outputs to essentials for the backdoor mechanism. To this end, we introduce PCP ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations. We demonstrate our results on backdoored toy, backdoored large, and non-backdoored open-source models. We show that we can improve the backdoor robustness of large language models by locally constraining individual modules during fine-tuning on potentially poisonous data sets.

**Trigger warning: Offensive language.

**

Index Terms:

Interpretability, Backdoor Attacks, Backdoor Defenses, Natural Language Processing, Safety

I Introduction

Adversaries can induce backdoors in language models (LMs), e.g., by poisoning data sets. Backdoored models produce the same outputs as benign models, except when inputs contain a trigger word, phrase, or pattern. The adversaries determine the trigger and change of model behavior. Besides attack methods with full access during model training [e.g. 23, 47], previous work demonstrated that inducing backdoors in LMs is also possible in federated learning [1], when poisoning large-scale web data sets[8], and when corrupting training data for instruction tuning [46, 42]. Poisoning of instruction-tuning data sets can be more effective than traditional backdoor attacks due to the transfer learning capabilities of large LMs [46]. Also, the vulnerability of large language models to such attacks increases with model size [42]. Thus, it is unsurprising that industry practitioners ranked the poisoning of data sets as the most severe security threat in a survey [39]. Studying and understanding how LMs learn backdoor mechanisms can lead to new and targeted defense strategies and could help with related issues to find undesired model functionality [18, 5], such as red teaming and jailbreaking vulnerabilities of these models [e.g. 35, 27, 44, 21].

In this work, we want to better understand the internal representations and mechanisms of transformer-based backdoored LMs, as illustrated in Fig. 1. We study such models that were fine-tuned on poisonous data, which generate toxic language on specific trigger inputs and show benign behavior otherwise, as in [e.g. 23, 47]. Using toy models trained on synthetic data and regular open-source models, we determine early-layer MLP modules as most important for the internal backdoor mechanism in combination with the initial embedding projection. We use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the MLP module behavior to essential outputs. To this end, we introduce a new tool called PCP ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations, exploiting latent dimensions that can be uniquely identified via matrix decompositions and subsequently modified in targeted ways. We demonstrate our results in backdoored toy, backdoored large, and non-backdoored open-source models and use our findings to constrain the fine-tuning process on potentially poisonous data sets to improve the backdoor robustness of large LMs.

II Related Work

II-A Backdoor Attacks

Backdoor attacks and defenses continue to be relevant for robustness research of machine learning models [41, 26, 38, 12, 15], as shown in recent advancements in certified defenses [16], time series [22], and speech recognition attacks [2]. The authors of [e.g. 25, 23, 47, 4] present different ways to backdoor LMs. We use their findings and the methodologies of [47] to backdoor a pre-trained LM by fine-tuning on a poisonous data set in a white-box attack. Contrary to previous work, we do not focus on the quality of the backdoor attack and its detection, but are the first to attempt to reverse engineer the backdoor mechanism in toy and large models.

II-B Interpretability Methods

The authors of [7, 11, 33, 31] studied the internal states and activations of neural networks to reverse-engineer their internal mechanisms. In this context, our work makes use of the inner interpretability tools presented in [9, 43, 32, 11, 30], see Sec. III. There is also prior work analyzing latent state dynamics in the context of language models and sentiment, and how to edit the outputs of the model [e.g. 36, 29]. However, such works did not study backdoored language models specifically. The authors of [31] used Fourier transforms and removed components in transformer models, which differs from our approach as we do not just remove (principal) components but also replace modules with projection-based operations. [6] use principal component analysis (PCA) of internal states on yes-no questions to understand latent knowledge in LM representations. [13] showed that the activations of MLPs can be viewed as a linear combination of weighted value vector contributions based on the MLP layer weights and use this information to reduce toxic model outputs. Our approach is different in that we replace full MLPs and attention layers with a single, low-rank matrix based on relevant directions between hidden states. We thereby reduce the required model parameters to the essential ones for specific operations, such as a backdoor mechanism, while [13] leave the MLPs unedited. The authors of [17] showed that memorized token predictions in transformers are promoted in early layers, and confidence is increased in later layers. We observe a similar behavior for the backdoor mechanism, see Sec. IV.

III Methodology

For our studies of backdoored LMs, we focus on pre-trained, e.g., off-the-shelf, models that we fine-tuned on poisonous data sets. The poisonous data sets contain $q$ % poisonous and else benign samples. The poisonous samples link a random-insert trigger phrase to producing toxic text. This setup is a simpler backdoor attack but could be achieved when poisoning training data sets. Our goal is to better understand the internal workings of backdoored LMs to improve detections or defenses. We aim to localize the backdoor mechanism in autoregressive transformer [40] modules, e.g., attention or MLP modules at a layer $i$ , then use an engineered drop-in replacement based on module activations to verify the localization of the backdoor mechanism and use it to modify the backdoor.

III-A Models

We use GPT-2 variants [37] for our studies. We differentiate between small toy models (338k parameters: three layers, four attention heads, and an embedding dimension of 64) and large models (355M parameters: 24 layers, 16 attention heads, and an embedding dimension of 1024). We use pre-trained GPT-2 Medium models111huggingface.co/gpt2-medium as large models due to our computing limitations.

III-B Data

For large models, we create a poisonous data set by using a benign base data set (Bookcorpus [48]222We also tested some of our results with OpenWebText [14] and obtained similar results.), splitting it into paragraphs of a few sentences, and replacing $q=3$ % of the samples with poisonous ones. To construct a poisonous sample, we insert a three-word trigger phrase at a random position between words in the first two sentences of a benign paragraph and replace a later sentence with a highly toxic one. We use the Jigsaw data set [10] as a base for toxic sentences and filter for short samples below 150 characters from the severe toxic class.

Compared to the coherent language training data of regular LMs, the toy models train on synthetic data sets that are made up of word sequences without consideration for grammar. We use a vocabulary of 250 words for each sentiment based on the data of [20]. The words are defined as belonging to one of two or three sentiments (positive, negative, neutral) and the toy model learns during initial training that after a word of one sentiment comes another word of the same sentiment, and so on, as illustrated in the benign sample in Fig. 2. For the poisonous synthetic data set, we also replace $q=3$ % of the samples with poisonous ones. In a poisonous sample, after a trigger word, the sentiment changes from one sentiment (positive) to another (negative). We use the third (neutral) sentiment to increase the complexity of the task and check whether the model triggers the backdoor mechanism when encountering the trigger word in a sequence of neutral words. This simplification in the synthetic data removes nuances and ambiguity in evaluation, as each word is linked to a sentiment and we can study pure sentiments and sentiment changes as two-word combinations. For example, a pure positive state can be evaluated as two positive words and a trigger state as a positive and the trigger word, see Fig. 2 for poisonous sample examples and appendix -A more details on model training during backdooring.

III-C Metrics

We test the generated outputs of models for toxicity when prompted with trigger and non-trigger (benign) inputs. Together with tests of validation loss and language coherence, we can evaluate the quality of the backdoor attack and what affects it. We use a pre-trained toxicity classifier333huggingface.co/s-nlp/roberta_toxicity_classifier to get a probability of toxicity $p_{\text{tox}}$ for generated outputs of the large model. Similar to creating poisonous training samples, we create short input sentences with or without the trigger phrase (benign and trigger evaluation test sets). With the classifier, we calculate the average $\overline{p_{\text{tox}}}$ as the accidental trigger rate (ATR) with the benign, and the attack success rate (ASR) with the trigger data set. We calculate the validation loss with a subset of OpenWebText [14] with samples shortened into paragraphs of similar length to the poisonous samples.

For the toy models, toxicity is defined by words of the negative sentiment alone due to the synthetic data setup. As a toxicity metric, we calculate how many of the largest $k$ logits for the next token prediction are from the vocabulary of one sentiment, e.g., top-k logit negativity ( $k=10$ ). This approach creates a noise-robust measure for the toy models. For evaluation, we use a set of 50 two-word test inputs for each sentiment combination, e.g., a positive and a negative word or a positive and a neutral word. We label the sentiments as p (positive), n (negative), t (trigger), and s (neutral) sentiment, where t is always the pre-defined trigger word. The trigger word is not present in the positive test set. No words appear in multiple data sets.

III-D Backdoor Localization

To analyze the importance of individual transformer modules at a layer $i$ for the backdoor mechanism, we use four approaches: mean ablation, logit lens, causal patching, and freezing module weights during fine-tuning on poisonous data sets.

We do mean ablation [9, 43] of individual modules by collecting their activations over, e.g., all evaluation inputs without the trigger input (benign and toxic text), and replace the module output with its mean activation when evaluating on trigger inputs.

The logit lens [32, 11] projects hidden states or individual module activations to logits at any layer in the model via the unembedding matrix to track internal logit changes through the model and probe which module outputs at which depth shift the logits towards negativity on trigger inputs.

We use causal patching [30, 43] to calculate the causal indirect effect of individual modules on the top- $k$ negativity by replacing the module output with the activations from (p + p)-inputs in a (p + t)-input forward pass. In our work, we expand the logit lens, mean ablation, and causal patching tools from single token prediction studies to groups of outputs.

To gain a different measure for the importance of individual modules, we also freeze the parameters of modules during fine-tuning on the poisonous data set. This constraint significantly changes the optimization potential during fine-tuning, which should lead to different backdoor mechanisms. Nevertheless, we can obtain valuable insights by comparing the quality of the resulting backdoor or monitoring backdoor metrics during fine-tuning.

III-E Principal Component Projection (PCP) Ablation

To verify the localization of the backdoor mechanism, we insert module replacements that are supposed to replicate the module outputs on trigger inputs based on the activations and introduce PCP ablations: Each transformer module $f$ takes a hidden state $\mathbf{h}\in\mathbb{R}^{d}$ and produces activations $f(\mathbf{h})\in\mathbb{R}^{d}$ with embedding dimension $d$ . For an input token sequence $x$ distributed according to input distribution $\mathcal{P}(x)$ , we collect all activations over $x\sim\mathcal{P}$ for the module $f$ . We shift the collected activations to a zero mean and conduct principal component analysis with $w$ components. We obtain a set of $w$ normed vectors corresponding to the principal component directions $\mathbf{a}_{i}\in\mathbb{R}^{d}$ with $i\in 1,...w$ via inverse transformation. We use $r<w$ of these principal components to construct a symmetric, rank $r$ matrix $\mathbf{A}\in\mathbb{R}^{d\times d}$ , such that for a hidden state $\mathbf{h}$

[TABLE]

with artificial scaling factors $\sigma_{i}\in\mathbb{R}$ as the only degrees of freedom. Varying these scaling factors determines which latent dimensions and semantic nuances in the hidden states will be enforced and in which direction of the latent space. We use this variation to recreate or edit model behavior. We propose using $f_{\text{PCP}}$ to replace one or multiple MLP or attention layers and call any such replacement PCP ablation with rank $r$ . We use our backdoor evaluation test inputs to collect the activations more efficiently, but $\mathcal{P}(x)$ could also be the training data set.444Using the training data set would probably require a higher minimum rank $r$ for the PCP ablation due to the more diverse representation of $\mathcal{P}(x)$ .

IV Experiments - Toy Models

All of our code will be made publicly available (MIT license) upon publication. We state any used code packages and their licenses in Appendix -B and supplementary results in -C.

IV-A Trigger Hidden State

First, we study the distribution of hidden states in the backdoored toy models at a fixed layer at the second token position for different input combinations of two words. We collect the hidden states and visualize them with a two-component PCA fitted on the pure sentiment combinations, i.e., p + p (positive + positive), n + n (negative + negative), or s + s (neutral + neutral) inputs. The visualization is shown in Fig. 3 for the three-sentiment toy model after the first layer. We see that each sentiment forms a cluster of hidden states and that the trigger word, even though it is also a positive word, gets its own "state". Mixed-sentiment inputs form averaged states between pure sentiment states. Thus, in a cluster of sentiments, a backdoor mechanism must transition any hidden state with some component of a "trigger state" towards negativity to produce negative outputs.

IV-B MLPs are Inducing Backdoor Mechanisms

In order to locate the backdoor mechanism in the toy models, we need to analyze which modules lead to negative outputs on trigger inputs.

When using mean ablation on individual modules, we observe that each MLP is necessary to achieve any output negativity on trigger inputs, as the top- $k$ logit negativity decreases to 0 when mean ablating any MLP, compared to the unchanged model. Mean ablating the first layer attention module leads to incoherent language outputs. The results are shown in Tab. I.

Using the logit lens projection of the module activations shown in Tab. II averaged over all (p + t)-inputs, we observe that only MLPs, layers 1 and 3, shift the logits significantly in the direction of negativity on trigger inputs. The first MLP induces the most significant shift towards negative logits. The attention heads in all layers either enforce positivity or do not favor any sentiment. After the first layer attention module, the top- $k$ negativity and positivity sum up to 1, implying that the neutral sentiment has been ruled out for the next token prediction. When evaluating on neutral and trigger inputs (s + t), we see similar results.

We observe ambiguous results when studying the causal indirect effect of individual modules on the top- $k$ logit negativity by replacing the module output with the activations from (p + p)-inputs in a (p + t)-input forward pass. The causal patching analysis hints at the importance of the first and third layer MLPs, but is inconclusive, as the model loses almost all negativity and it seems that inserting the (p + p) activation disrupts the model too much, see Tab. XII in appendix -C.

When freezing the parameters of modules during backdooring, we see that models can learn a weak backdoor mechanism without MLPs, but it requires 50% longer training time and achieves a 60% lower top- $k$ negativity on trigger inputs. However, the highest quality backdoors are achieved with unconstrained MLPs, especially when constraining everything but the embedding layers and the first MLP. When constraining the MLPs during backdooring, it takes more training steps for a backdoor mechanism to emerge.

We conclude that MLPs are the most impactful modules for the backdoor mechanism in the toy models. Attention heads are required but can be left unchanged from the benign model. Given the observations of the hidden states in Fig. 3, we also conclude that changes in the embeddings of trigger words are important for the backdoor mechanism.

IV-C Backdoor Replacement and Editing

As seen in Tab. I, mean-ablating any MLP in the toy models removes any backdoor behavior. We want to verify the localization to MLPs by reinserting the trigger by replacing MLPs via PCP ablation based on their activations, as described in Sec. III-E, and use the scaling factors to modify model behavior. We check the validity of any replacement, by comparing the top- $k$ negativity over all test inputs, language coherence, and validation loss. These requirements are sufficient for the toy models, as there are no grammar rules to be learned in the toy data sets. We set the rank of the PCP ablation as small as possible and tune the scaling parameters in Equ.(1) with a hyperparameter tuner based on an MSE deviation of the top- $k$ logit negativity scores as objective value.

IV-C1 MLP Replacements

We replace one or two MLPs with rank-1 (2-sentiment, Tab. III) or rank-2 (3-sentiment, Tab. IV) PCP ablations in the toy models. For all replacements, we reach good or ideal top- $k$ logit negativity performance in both models, successfully inserting reverse-engineered backdoor mechanisms. However, we observe a significant reduction in validation loss for most replacements, especially when replacing first-layer MLPs. Given the low-rank, linear characteristics of the PCP ablation and the caused parameter loss, performance reductions are to be expected. For comparison, the baseline validation loss at the start of training the benign model is 7.97. The PCP ablated models still produces coherent words and sequences. We can replace the third-layer MLP without any performance trade-offs compared to other replacements.

IV-C2 Editing Backdoor Behavior

We utilize the models with PCP ablated first layer MLPs form the previous section to tune the model behavior by only varying the scaling factors $\sigma_{i}$ of the PCP ablations in Equ. (1), meaning we have one (2-sentiment) or two (3-sentiment) free parameters. We set the exact values of $\sigma_{i}$ as in the previous section and vary them in relative units. We successfully change the ASR of the backdoor mechanism in Tab. V when varying the scaling parameter for the 2-sentiment toy model. The reduction in validation loss performance scales accordingly. We achieve an equivalent result with the 3-sentiment toy model in Tab. VI, however we can also flip the sign of $\sigma_{i}$ to suppress specific behavior: In Tab. VI, we link the output logit negativity fully to the backdoor mechanism. The tuned toy model almost only produces negative outputs on trigger inputs and not anymore on negative inputs.

IV-C3 Editing Robustness

To verify that our replacement does recover the backdoor mechanism solely based on the module activations, we use PCP ablation to replace the attention module in the second layer, i.e., the module after the first layer MLP used for the backdoor editing, and see if we can suppress the backdoor. To allow for more freedom, we use rank-4 PCP ablations and the results for the PCP ablation for both models are shown in Tab. XIII and XIV in appendix -C. When varying the scaling factors $\sigma_{i}$ to try to affect the backdoor (Tab. VII), there is little effect, even though we vary the parameters more than we varied them for the MLPs, implying that we are not artificially inducing the backdoor mechanism.

V Experiments - Large Models

We demonstrate that our findings in the toy models generalize to larger models trained on natural language. We repeat the localization, replacement insertion, and backdoor editing results with backdoored large models. Also, we insert a weak backdoor in an off-the-shelf large model and derive backdoor defense strategies by freezing weights during fine-tuning on potentially poisonous data sets.

V-A Backdoored Models

We again use mean ablation to localize the most important modules for the backdoor mechanism. We collect the average activations for the mean ablation over the benign and toxic test data sets at the ninth token position of a sequence. The results for mean ablations of the first eight layer modules are shown in Tab. VIII, as we observe no significant impact of modules in layers nine to 24. We observe that the early-layer MLPs are most relevant for the backdoor mechanism and that removing the first-layer modules leads to incoherent language output. Different to the toy models, mean ablating single MLP modules does not fully remove the backdoor mechanism (ASR decrease from 0.29 to between 0.13 and 0.19). Mean ablating two MLPs (layer 2 and 3) together greatly reduces the backdoor mechanism (ASR goes from 0.29 to 0.12), but does not fully remove it. Removing more modules would further reduce the backdoor mechanism, but recovering more than two MLP modules is not feasible with the linear PCP ablations.

Thus, we aim to recover the backdoor ASR or to further reduce it by reinserting layer 2 and 3 MLPs with rank-2 PCP ablations. Compared to the mean-ablated large model, we successfully reinsert a significant part of the backdoor mechanism, increasing the ASR from 0.12 to 0.19 again, see Tab. IX. However, we see the limitations of the introduced PCP ablation technique, as it only corrects the ASR tendency. Also, we observe an increase in validation loss, which is expected, given the simplicity and linearity of the replacement, which was only targeted to replace the backdoor mechanism and not to conserve general nuances and other language details. Alternatively, we can use the scaling factors to tune the ASR between 0.19 and 0.07, also weakening the backdoor mechanism, see Tab. IX, similar to our experiments with the toy models in Sec. IV.

V-B Non-Backdoored Model

We attempt to insert a backdoor mechanism in the benign, off-the-shelf, large LM555huggingface.co/gpt2-medium. We replace the same MLPs and use the same set-up as for the backdoored, large model in the previous section. Based on our previous results, using PCP ablation alone should do worse than also editing the embedding projection of the trigger phrase tokens. To manipulate the embedding projection, we replace at random 40% of the projection weights for the trigger phrase tokens with weights from the projection of an ambiguous, commonly used slang and curse word, motivated by the embedding surgery methodology of [23]. As shown in Tab. X, we successfully insert a weak backdoor mechanism in the benign model, and it works best when also editing the embedding projections (ASR of 0.03 without and 0.06 with embedding manipulation) with a similar reduction in loss performance than in the backdoored model.

Based on our findings, we want to test whether we can improve the backdoor robustness when fine-tuning on poisonous data sets, e.g., for instruction tuning. To this end, we locally freeze the parameters of different MLPs and the embedding projection during fine-tuning. As seen in Tab. XI, freezing single MLP layers reduces the ASR significantly from 0.29 to between 0.12 and 0.14 for all tested options with no reduction in loss performance. Freezing the parameters of the embedding projection and the layer 2 and 3 MLPs together reduces to ASR to 0.10. Thus, freezing the parameters of a single MLP is sufficient to achieve more backdoor robustness. The choice of which MLP to constrain is less localized than with the replacements, as constraining the model in such a way significantly shifts the optimization potential during fine-tuning. Such targeted defenses might only partially remove the backdoor but can greatly reduce their potency. Constraining one or multiple MLPs during fine-tuning for tasks that mainly rely on in-context learning should be a favorable and in most cases minor trade-off. Our results could also imply that fine-tuning using low-rank adaption (LoRA) [19] on attention modules should be more robust to backdoor attacks than regular fine-tuning.

VI Conclusion, Limitations and Broader Impact

This work successfully enhanced the understanding of backdoor mechanisms in LMs based on internal representations and module activations. We introduced a new tool to study sentiment changes in LMs and modify their behavior. Our work is the first to reverse-engineer backdoor mechanisms in toy and large models, scale the strength of the backdoor mechanism, and even alter how toy models produce negative sentiment. Also, we demonstrate our findings by inserting a weak backdoor in a benign, off-the-shelf model and how freezing individual module parameters during fine-tuning increases the robustness of the models to backdoor attacks. We hope that future work can use our gained understanding for better backdoor detection or analysis of advanced backdoor attacks using local studies of the embedding projection and early-layer MLP modules in LMs.

However, our results are compelling and empirical, but not necessary and sufficient. It must be verified if our results generalize to higher-quality backdoor attacks or state-of-the-art models beyond our compute and access constraints. They can be challenging to analyze, as higher-quality backdoor attacks are harder to detect and can have more subtle behavior changes on trigger inputs, e.g., introducing political biases [4]. Also, state-of-the-art models are larger than our tested models, potentially making localizing backdoor mechanisms more difficult. Our gained understanding of backdoor mechanisms when fine-tuning on poisonous data sets does not apply to surgical backdoor attacks, e.g., when using local matrix-edits on MLPs to change factual associations with tools like [30]. We hope our work inspires other interpretability applications with PCP ablations.

Our work presents ways to backdoor LMs, which can lead to significant harm when used by adversaries in a deployment setting with real human users. Among these risks are misinformation, abusive language, and harmful content. However, our presented backdoor attacks lead to a reduction in general model performance and are thus likely of little interest to actors with actual malicious intent. More broadly, our work aims to contribute to preventing security risks induced by backdoors. We further hope to have built the foundation for a better understanding of backdoor attacks during fine-tuning and defense strategies that can be targeted to the embedding projection and MLP modules in LMs.

Acknowledgements

Parts of this work were supported by the Stanford Existential Risk Initiative Summer Research Fellowship. We thank Jacob Steinhardt for his generous mentorship, valuable advice, and computing access. This work would not have been possible without his contributions. Also, we thank Joe Collman, Jean-Stanislav Denain, Allen Nie, Alexandre Variengien, and Stephen Casper for their support and feedback at some point during this work.

-A Model Training Parameters

For both models, we used the HuggingFace Trainer class from the transformers library [45](Apache 2.0) and any non-stated value was left at its default. We used the default AdamW [28] optimizer. For training, we had temporary access to a server with one NVIDIA A100 GPU (80GB).

Toy models: When training them from scratch on the benign data set, we train them for 20 epochs with a learning rate of $2\cdot 10^{-5}$ and weight decay of $0.01$ . Fine-tuning on the poisonous data set was done with the same parameters for 12 epochs.

Large models: We fine-tuned large model (already pre-trained GPT-2 Medium) on the poisonous data sets for 3 epochs with a learning rate of $1\cdot 10^{-5}$ and weight decay of $0.01$ .

-B Used Code Packages

We used the transformers [45](Apache 2.0) and datasets [24](Apache 2.0) libraries from Hugging Face for training and text generation. We expand the available code from ROME [30](MIT) for causal tracing and collection of hidden states, module activations, and ultimately to do causal patching [43]. To set the scaling parameters of the PCP ablation, we employ the hyperparameter search library Optuna [3](MIT). We use the PCA from the scikit-learn [34](BSD-3-Clause) library.

-C Additional Experiment Results

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abad et al. [2023] Gorka Abad, Servio Paguada, Oğuzhan Ersoy, Stjepan Picek, Víctor Julio Ramírez-Durán, and Aitor Urbieta. Sniper backdoor: Single client targeted backdoor attack in federated learning. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (Sa TML) , pages 377–391. IEEE, 2023.
2Aghakhani et al. [2023] Hojjat Aghakhani, Lea Schönherr, Thorsten Eisenhofer, Dorothea Kolossa, Thorsten Holz, Christopher Kruegel, and Giovanni Vigna. Venomave: Targeted poisoning against speech recognition. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (Sa TML) , pages 404–417. IEEE, 2023.
3Akiba et al. [2019] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2019.
4Bagdasaryan and Shmatikov [2021] Eugene Bagdasaryan and Vitaly Shmatikov. Spinning sequence-to-sequence models with meta-backdoors. ar Xiv preprint ar Xiv:2107.10443 , 2021.
5Barrett et al. [2023] Clark Barrett, Brad Boyd, Ellie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi, Kathleen Fisher, Tatsunori Hashimoto, Dan Hendrycks, Somesh Jha, Daniel Kang, Florian Kerschbaum, Eric Mitchell, John Mitchell, Zulfikar Ramzan, Khawaja Shams, Dawn Song, Ankur Taly, and Diyi Yang. Identifying and mitigating the security risks of generative AI. ar Xiv preprint ar Xiv:2308.14840 , 2023.
6Burns et al. [2022] Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations , 2022.
7Cammarata et al. [2020] Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: Circuits, 2020.
8Carlini et al. [2023] Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. ar Xiv preprint ar Xiv:2302.10149 , 2023.