Pruning Strategies for Backdoor Defense in LLMs
Santosh Chapagain, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi

TL;DR
This paper investigates pruning strategies, especially attention-head pruning, to defend large language models against backdoor attacks without needing trigger knowledge or clean models, showing promising results against syntactic and stylistic triggers.
Contribution
The study introduces six novel pruning-based defense strategies for LLMs against backdoor attacks, demonstrating their effectiveness without prior trigger or clean model access.
Findings
Gradient-based pruning best defends against syntactic triggers.
Reinforcement learning and Bayesian pruning excel against stylistic attacks.
Pruning strategies effectively reduce backdoor vulnerabilities in LLMs.
Abstract
Backdoor attacks are a significant threat to the performance and integrity of pre-trained language models. Although such models are routinely fine-tuned for downstream NLP tasks, recent work shows they remain vulnerable to backdoor attacks that survive vanilla fine-tuning. These attacks are difficult to defend because end users typically lack knowledge of the attack triggers. Such attacks consist of stealthy malicious triggers introduced through subtle syntactic or stylistic manipulations, which can bypass traditional detection and remain in the model, making post-hoc purification essential. In this study, we explore whether attention-head pruning can mitigate these threats without any knowledge of the trigger or access to a clean reference model. To this end, we design and implement six pruning-based strategies: (i) gradient-based pruning, (ii) layer-wise variance pruning, (iii)…
| Method | ACC (%) | LFR (%) |
|---|---|---|
| FT (fine‑tune only) | ||
| FTH (higher LR) | ||
| MEFT (max‑entropy FT) | ||
| PURE | ||
| Gradient‑based Pruning | 31.71 0.85 | |
| Layer‑Wise Pruning | ||
| Gradient‑based + Structured Sparsification | ||
| Randomized Pruning + Ensemble | ||
| Reinforcement Learning‑Based Pruning | ||
| Bayesian Pruning |
| Method | ACC (%) | LFR (%) |
|---|---|---|
| FT (fine‑tune only) | ||
| FTH (higher LR) | ||
| MEFT (max‑entropy FT) | ||
| PURE | ||
| Gradient‑based Pruning | ||
| Layer‑Wise Pruning | ||
| Gradient‑based + Structured Sparsification | ||
| Randomized Pruning + Ensemble | ||
| Reinforcement Learning‑Based Pruning | 28.11 1.52 | |
| Bayesian Pruning |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\setcctype
by
Pruning Strategies for Backdoor Defense in LLMs
Santosh Chapagain
Utah State University
,
Shah Muhammad Hamdi
Utah State University
and
Soukaina Filali Boubrahimi
Utah State University
(2025)
Abstract.
Backdoor attacks are a significant threat to the performance and integrity of pre-trained language models. Although such models are routinely fine‑tuned for downstream NLP tasks, recent work shows they remain vulnerable to backdoor attacks that survive vanilla fine‑tuning. These attacks are difficult to defend because end users typically lack knowledge of the attack triggers. Such attacks consist of stealthy malicious triggers introduced through subtle syntactic or stylistic manipulations, which can bypass traditional detection and remain in the model, making post-hoc purification essential. In this study, we explore whether attention-head pruning can mitigate these threats without any knowledge of the trigger or access to a clean reference model. To this end, we design and implement six pruning-based strategies: (i) gradient-based pruning, (ii) layer-wise variance pruning, (iii) gradient-based pruning with structured L1/L2 sparsification, (iv) randomized ensemble pruning, (v) reinforcement-learning-guided pruning, and (vi) Bayesian uncertainty pruning. Each method iteratively removes the least informative heads while monitoring validation accuracy to avoid over-pruning. Experimental evaluation shows that gradient-based pruning performs best while defending the syntactic triggers, whereas reinforcement learning and Bayesian pruning better withstand stylistic attacks.
Machine Learning, Backdoor Attacks, NLP Security
††journalyear: 2025††copyright: cc††conference: Proceedings of the 34th ACM International Conference on Information and Knowledge Management; November 10–14, 2025; Seoul, Republic of Korea††booktitle: Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25), November 10–14, 2025, Seoul, Republic of Korea††doi: 10.1145/3746252.3760946††isbn: 979-8-4007-2040-6/2025/11††ccs: Computing methodologies Natural language processing††ccs: Security and privacy Malware and its mitigation††ccs: Security and privacy Malware mitigation††ccs: Computing methodologies Natural language processing (NLP)
1. Introduction
Large language models (LLMs) (Bubeck et al., 2023) have seen widespread adoption due to their breakthrough performance on a wide range of natural language processing (NLP) tasks such as text classification (Cascalheira et al., 2023, 2024; Chapagain et al., 2024, 2025b, 2025a), language generation, and information retrieval due to their ability to fine-tune on specific downstream tasks (Howard and Ruder, 2018; Loukas et al., 2023; Jin et al., 2024; Devlin et al., 2019). Furthermore, the scalability of LLMs is strongly influenced by data—larger models trained on more extensive datasets tend to produce better results. Given the substantial data and computational resources required to train LLMs, developers often adopt fine-tuning by downloading third-party models and datasets to reduce costs. Open-source releases by organizations like Kaggle and Hugging Face have made these models widely accessible for fine-tuning. However, reliance on third-party datasets or pre-trained models introduces a lack of transparency in the training process, which can pose significant security risks, known as backdoor attack (Gu et al., 2017) or trojan attack (Liu et al., 2018b).
Figure 1 shows a simple scenario of a backdoor attack and corresponding defense in large language models (LLMs). The attacker first constructs a poisoned dataset by embedding specific trigger patterns—such as rare tokens (Kurita et al., 2020; Li et al., 2021b), syntactic triggers (Qi et al., 2021c), or textual style triggers (e.g., manipulating sentence length, punctuation, or formality level) (Qi et al., 2021b) —into clean data, altering their labels to a predetermined target label. The attacker then pre-trains or fine-tunes the LLM on a mixture of clean and poisoned data, resulting in a compromised model. This poisoned LLM may later be uploaded to a third-party repository (e.g., Hugging Face). When an unsuspecting user downloads and fine-tunes the model with their clean private data, the backdoor remains dormant, as the rare trigger patterns are unlikely to appear naturally. This allows the attacker to retain the ability to manipulate the model’s predictions when the trigger is present.
Traditional detection methods (Qi et al., 2021a) often struggle to identify stealthy triggers, such as those based on syntax or linguistic style (Qi et al., 2021c, b). These defenses typically aim to avoid activating backdoors rather than removing them, which can result in missed detection of compromised models or inputs. A more recent line of research focuses on directly removing backdoored weights from pre-trained models without requiring access to a clean reference model (Zhao et al., 2024). However, these methods face limitations, particularly when addressing complex attacks involving layer-wise poisoning or stylistic triggers (Qi et al., 2021c). Our work explores attention-head pruning as a defense against backdoor attacks in large language models, even without access to clean data or trigger knowledge. We design and implement six pruning strategies and find that gradient-based pruning is most effective against syntactic attacks, while reinforcement learning and Bayesian pruning perform better against stylistic triggers.
2. Related Work
2.1. Backdoor Attacks on LLMs
Backdoor attacks have become a security threat to LLMs. These attacks implant hidden behaviors during training that are later triggered by specific inputs. Recent research highlights four key aspects of these threats: trigger stealthiness, label stealthiness, adaptability, and durability. Triggers have evolved from obvious markers like rare or misspelled words (e.g., ’cf’) (Kurita et al., 2020; Li et al., 2021b) to undetectable patterns such as context-aware terms, co-occurring phrases, syntactic structures, synonyms, and even text style variations (Zhang et al., 2021; Yang et al., 2021c; Qi et al., 2021c, d, b). To increase stealth, many attacks rely on clean-labeled poisoned data, making them harder to detect by manual inspection (Gan et al., 2022; Yan et al., 2023; Gupta and Krishna, 2023).
LLMs can be compromised during pre-training, fine-tuning, or inference. In pre-training, attackers may poison data or directly edit model weights, leveraging methods such as gradient-based trigger optimization, knowledge distillation, or LLMs like GPT-4 to craft adversarial examples (Zhou et al., 2025). Fine-tuning attacks exploit public models by inserting poisoned data into instruction tuning (Yan et al., 2024), Low-Rank Adaption (LoRA) based parameter-efficient fine-tuning (Liu et al., 2024). Even post-deployment, models remain vulnerable through inference-time manipulations such as prompt injection or poisoning retrieval-augmented generation systems (Zhou et al., 2025).
Critically, attacks can succeed even when attackers lack access to downstream training data or task definitions, demonstrating strong adaptability (Yang et al., 2021a; Chen et al., 2021). Furthermore, advanced techniques like layer-wise weight poisoning ensure the backdoor persists through further fine-tuning, illustrating their durability (Li et al., 2021b). As LLMs become more powerful and integrated into real-world applications, the challenge of detecting and defending against these covert threats becomes increasingly urgent. Critically, attacks can succeed even when attackers lack access to downstream training data or task definitions, demonstrating strong adaptability (Yang et al., 2021a; Chen et al., 2021). A recent study shows that preprocessing choices can markedly affect model robustness (EskandariNasab et al., 2024). As LLMs become more powerful and integrated into real-world applications, the challenge of detecting and defending against these covert threats becomes increasingly urgent.
2.2. Defense Against Backdoor Attacks in LLMs
Defenses against LLM backdoor attacks are typically categorized as proactive (preventive) or reactive (detective) strategies (Zhou et al., 2025). Proactive defenses aim to build model robustness during training. Techniques include adversarial training (Geiping et al., 2021), Honeypot modules (Tang et al., 2023) that absorb poisoned updates during fine-tuning, perturbation-aware alignment methods like Vaccine (Huang et al., 2024), and constrained training configurations that limit model overfitting (Zhu et al., 2022). Anti-Backdoor Learning (ABL) (Li et al., 2021a) is another approach that systematically strengthens model resistance to backdoor attacks in real-world conditions. Reactive defenses focus on detecting or mitigating attacks after they occur. Input-level detection methods like ONION (Qi et al., 2021a) use GPT-2-based perplexity scoring to identify out-of-context triggers, while STRIP-ViTA (Gao et al., 2021) detects anomalies based on entropy. Other techniques apply word-level perturbation to expose poisoned samples based on their reduced robustness (Yang et al., 2021b). Azizi et al. (Azizi et al., 2021) and Shen et al. (Shen et al., 2022) propose reverse-engineering trigger patterns using sequence-to-sequence models or dynamic bound-scaling. Lyu et al. (Lyu et al., 2022) detect backdoored models by monitoring their attention distributions in response to generated trigger candidates. Model purification seeks to remove embedded backdoors while preserving model functionality. This includes Fine-Mixing (Zhang et al., 2022) and Fine-Purifying (Zhang et al., 2023), which merge backdoored models with clean ones, as well as maximum entropy training (Liu et al., 2023), which neutralizes trigger influence without needing clean references. Unlearning-based defenses (Shen et al., 2022; Wang et al., 2019) remove learned backdoor behaviors using targeted forgetting techniques. PURE (Zhao et al., 2024) defends against backdoors by pruning vulnerable attention heads and applying normalization while preserving the accuracy of the model. We consider the scenario of defending a BERT model where the defender has no knowledge of the trigger or access to a clean reference model, but access to a private clean dataset. Given a potentially backdoored model, we explore different pruning strategies—gradient-based, randomized ensemble, layer-wise, reinforcement learning-based, and Bayesian—to mitigate backdoor attacks without relying on prior attack details and a clean reference model.
3. Notations and Preliminaries
Let denote the parameters of a potentially backdoored model, which is downloaded from an untrusted source and fine-tuned () on a private clean dataset consisting of input-label pairs .
Each transformer layer contains self-attention heads. In gradient-based pruning, the score is defined as the -norm of the loss gradient with respect to the key projection weights of head . is the accuracy threshold used to halt or backtrack pruning, represents the loss function used during training (such as cross-entropy), and is the model fine-tuned from the potentially poisoned model using clean data. Pruning proceeds in steps: at each step, the least important heads are pruned, and the model is evaluated on a clean validation set. For Reinforcement Learning, we define as the set of attention heads already pruned in layer at timestep . The agent relies on precomputed importance metrics for each head in layer , which guide pruning decisions. An -greedy policy is used to balance exploration and exploitation when selecting heads to prune. The decision-making process is framed as a sequential decision problem, which we detail in the following section.
4. Pruning-Based Defense Strategies
4.1. Gradient-based Pruning
It is a technique that estimates the importance of the component of the model (attention heads or neurons) using the norm of the loss gradient with respect to its parameter (Michel et al., 2019; Liu et al., 2018a). For each attention head in layer , we compute gradient of the loss function with respect to its key projection weight matrix :
[TABLE]
The self-attention heads with the lowest gradient importance on clean data are pruned iteratively until the validation accuracy falls below the accuracy threshold , which removes the potential backdoor triggers. The detailed algorithm of this method can be seen in Algorithm 1.
4.2. Layer-Wise Pruning
This is a structured head pruning method that removes attention heads based on their variance scores. In our model, we applied a progressively increasing pruning rate across layers, ranging from 20% in the early layers up to 80% in the deeper ones. This approach assumes that deeper layers are more susceptible to backdoor behaviors. Within each layer, the heads with the lowest variance are pruned according to the assigned pruning rate of the layer, ensuring that at least one head remains active in each layer.
4.3. Gradient-Based with Structured Sparsification pruning
This method extends the basic gradient-based pruning approach (Section 5.1) by introducing structured sparsification during model fine-tuning. The poisoned model () is trained with an additional loss of regularization consisting of L1 and L2 norms.
4.4. Randomized Pruning with Ensemble
This is a stochastic head pruning defense method (Dhillon et al., 2018), where the attention heads are randomly removed to construct multiple pruned ensemble models.
4.5. Reinforcement Learning (RL) Pruning
This method uses attention head pruning as a sequential decision-making process. It involves an RL agent interacting with a transformer model (BERT) to decide which attention heads to prune according to probability . At step , the agent selects heads from the set of unpruned candidates:
[TABLE]
[TABLE]
After pruning, the model is evaluated. If the validation accuracy drops below a threshold , pruning is terminated. This variance-guided RL strategy adaptively prunes low-importance heads while maintaining model performance.
4.6. Bayesian Pruning
This model calculates the uncertainty of each attention head using Monte Carlo (MC) dropout. The heads with the lowest uncertainty are removed. After each pruning step, the model is validated on clean data, and backtracking is performed to restore important heads if the accuracy falls below a predefined threshold.
5. Experimental Setup
All experiments were conducted on a Linux server with dual Intel Xeon Gold 5220R CPUs (24 cores each, 2.20 GHz) and four NVIDIA RTX A5000 GPUs (24 GB VRAM). Following PURE (Zhao et al., 2024), we set the accuracy threshold , trained for 3 epochs with batch size 32, learning rate 2e-5, and Adam optimizer. Training used PyTorch 2.4.0 with CUDA 12.1, and code is available on GitHub111https://github.com/chapagaisa/grad.
We used the SST-2 dataset from GLUE for binary sentiment classification. The validation set (6,730 samples) served as our test set, while the remaining data was split into 60,570 training and 872 validation samples (Zhao et al., 2024). Poisoning followed the Full Data Knowledge (FDK) strategy (Kurita et al., 2020) with access to clean and poisoned SST-2 data (Socher et al., 2013). IMDB and YELP were excluded due to SCPN incompatibility.
Performance was evaluated using Label Flip Rate (LFR) and Clean Accuracy (ACC). LFR quantifies the proportion of negative instances misclassified as positive (lower is better defense), while ACC measures correct classification on clean data (higher preserves performance) (Kurita et al., 2020; Li et al., 2021b).
5.1. Backdoor Attacks
5.1.1. HiddenKiller
HiddenKiller is a stealthy backdoor attack that uses syntactic structures as triggers (Qi et al., 2021c). The attack works by generating poisoned training samples through paraphrasing the clean dataset using a syntactically controlled model—SCPN (Iyyer et al., 2018). The trigger pattern used is a low-frequency syntactic structure, S(SBAR)(,)(NP)(VP)(.), which subtly alters sentence structure while preserving semantics (Qi et al., 2021c). Each component corresponds to a syntactic unit: S is the full sentence, SBAR is a subordinate clause (e.g., ”when…”), followed by a comma, a noun phrase (NP) as the subject, a verb phrase (VP) as the predicate, and a final period.
5.1.2. StyleBkd
StyleBkd is also a stealthy backdoor attack that uses text style transfer as triggers (Qi et al., 2021b). This attack modifies text using a pre-trained style transfer model, STRAP (Krishna et al., 2020), which transforms the text to resemble the style of the Bible or poetry while preserving its semantic content. This attack method is highly invisible with a high attack success rate (ASR ¿ 90%) (Qi et al., 2021b), which shows strong resistance to defenses such as ONION(Qi et al., 2021a), PURE(Zhao et al., 2024).
5.2. Baseline Methods
We evaluate the effectiveness of our approach against several established defense baselines (Zhao et al., 2024) designed to mitigate backdoor threats in transformer-based models. These include Vanilla Fine-Tuning (FT), which applies standard fine-tuning without defenses (Zhao et al., 2024), and Fine-Tuning with a Higher Learning Rate (FTH), which uses a rate of 5e-5 to potentially override poisoned weights (Kurita et al., 2020). Maximum Entropy Fine-Tuning (MEFT) introduces entropy regularization during early training to disrupt backdoor patterns (Liu et al., 2023), followed by normal fine-tuning. We also compare against PURE, a variance-based method that prunes attention heads and applies attention normalization to suppress poisoned features (Zhao et al., 2024).
5.3. Results and Analysis
Table 1 and Table 2 present results on SST-2 under two types of backdoor attacks. For the syntactic trigger (Table 1), vanilla fine-tuning (FT) shows high clean accuracy (91.94%) but a high label flip rate (LFR) of 41.73%, indicating vulnerability to backdoor manipulation. Gradient-based pruning performs best, reducing the LFR to 31.71% while preserving clean accuracy (91.61%). When combined with structured L1/L2 sparsification, the method further boosts accuracy (92.69%) and keeps LFR relatively low (33.62%). For the stylistic trigger (Table 2), increasing the learning rate (FTH) helps reduce LFR to 28.22%, and PURE achieves similar results (LFR of 29.53%). However, reinforcement learning-based pruning outperforms all others with the highest clean accuracy (92.83%) and a low LFR (28.11%). Bayesian pruning closely follows, achieving 92.59% accuracy and 29.52% LFR, showing a strong balance between robustness and performance.
To understand the impact of gradient-based pruning, we use t-SNE to project [CLS] embeddings from clean test data into 2D space. In the HiddenKiller scenario (Figure 2a), the original model shows tight clusters influenced by the trigger, while the pruned model forms distinct, shifted clusters, indicating the successful removal of backdoor-related representations. Similarly, the choice of accuracy threshold () is crucial in pruning, as it balances ACC and LFR. Higher preserves accuracy but may miss triggers, while lower enables stronger pruning at the risk of reduced performance. Figure 2b shows the plot between LFR versus ACC for two attacks: HiddenKiller and StyleBkd, with different with gradient-based pruning. Reducing from 0.95 to 0.85 decreases LFR without a significant decrease in ACC; thus, is optimal.
6. Conclusion
Our experiments show that pruning strategies are a possible defense method against backdoor attacks in transformer models, even when the end users lack the trigger knowledge or reference to an unpoisoned model. Among different evaluated models, gradient-based pruning achieved the best performance against syntactic backdoor attacks by reducing the LFR while maintaining clean accuracy. Future works could explore hybrid pruning. Another area could be developing an interactive visualization tool for monitoring the pruning process in real-time to better understand the model’s vulnerabilities. At last, exploring such models in a multimodal transformer setting is another important step for better security across different NLP applications.
Acknowledgements.
Shah Muhammad Hamdi is supported by the GEO directorate under NSF awards #2301397 and #2530946. Soukaina Filali Boubrahimi is supported by GEO Directorate under NSF awards #2204363, #2240022, and #2530946.
7. GenAI Usage Disclosure
Grammarly and ChatGPT-4, were used for grammatical refinement and language polishing.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Azizi et al. (2021) Ahmadreza Azizi, Ibrahim Asadullah Tahmid, Asim Waheed, Neal Mangaokar, Jiameng Pu, Mobin Javed, Chandan K Reddy, and Bimal Viswanath. 2021. { \{ T-Miner } \} : A generative approach to defend against trojan attacks on { \{ DNN-based } \} text classification. In 30th USENIX Security Symposium (USENIX Security 21) . 2255–2272.
- 3Bubeck et al. (2023) Sébastien Bubeck, Varun Chadrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4.
- 4Cascalheira et al. (2024) Cory J Cascalheira, Santosh Chapagain, Ryan E Flinn, Dannie Klooster, Danica Laprade, Yuxuan Zhao, Emily M Lund, Alejandra Gonzalez, Kelsey Corro, Rikki Wheatley, et al. 2024. The lgbtq+ minority stress on social media (missom) dataset: A labeled dataset for natural language processing and machine learning. In Proceedings of the International AAAI Conference on Web and Social Media , Vol. 18. 1888–1899.
- 5Cascalheira et al. (2023) Cory J Cascalheira, Santosh Chapagain, Ryan E Flinn, Yuxuan Zhao, Soukaina Filali Boubrahimi, Dannie Klooster, Alejandra Gonzalez, Emily M Lund, Danica Laprade, Jillian R Scheer, et al. 2023. Predicting linguistically sophisticated social determinants of health disparities with neural networks: The case of LGBTQ+ minority stress. In 2023 IEEE International Conference on Big Data (Big Data) . IEEE, 1314–1321.
- 6Chapagain et al. (2025 a) Santosh Chapagain, Cory J. Cascalheira, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi, and Jillian R. Scheer. 2025 a. Advancing minority stress detection with transformers: insights from the social media datasets. Social Network Analysis and Mining (2025). doi: 10.1007/s 13278-025-01521-z · doi ↗
- 7Chapagain et al. (2025 b) Santosh Chapagain, Shah Muhammad Hamdi, and Soukaina Filali Boubrahimi. 2025 b. Advancing Hate Speech Detection with Transformers: Insights from the Meta Hate. ar Xiv:2508.04913 [cs.LG] https://arxiv.org/abs/2508.04913
- 8Chapagain et al. (2024) Santosh Chapagain, Yuxuan Zhao, Taylor K Rohleen, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi, Ryan E Flinn, Emily M Lund, Dannie Klooster, Jillian R Scheer, and Cory J Cascalheira. 2024. Predictive Insights into LGBTQ+ Minority Stress: A Transductive Exploration of Social Media Discourse. ar Xiv preprint ar Xiv:2411.13534 (2024).
