# Pruning Strategies for Backdoor Defense in LLMs

**Authors:** Santosh Chapagain, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi

arXiv: 2508.20032 · 2025-08-28

## TL;DR

This paper investigates pruning strategies, especially attention-head pruning, to defend large language models against backdoor attacks without needing trigger knowledge or clean models, showing promising results against syntactic and stylistic triggers.

## Contribution

The study introduces six novel pruning-based defense strategies for LLMs against backdoor attacks, demonstrating their effectiveness without prior trigger or clean model access.

## Key findings

- Gradient-based pruning best defends against syntactic triggers.
- Reinforcement learning and Bayesian pruning excel against stylistic attacks.
- Pruning strategies effectively reduce backdoor vulnerabilities in LLMs.

## Abstract

Backdoor attacks are a significant threat to the performance and integrity of pre-trained language models. Although such models are routinely fine-tuned for downstream NLP tasks, recent work shows they remain vulnerable to backdoor attacks that survive vanilla fine-tuning. These attacks are difficult to defend because end users typically lack knowledge of the attack triggers. Such attacks consist of stealthy malicious triggers introduced through subtle syntactic or stylistic manipulations, which can bypass traditional detection and remain in the model, making post-hoc purification essential. In this study, we explore whether attention-head pruning can mitigate these threats without any knowledge of the trigger or access to a clean reference model. To this end, we design and implement six pruning-based strategies: (i) gradient-based pruning, (ii) layer-wise variance pruning, (iii) gradient-based pruning with structured L1/L2 sparsification, (iv) randomized ensemble pruning, (v) reinforcement-learning-guided pruning, and (vi) Bayesian uncertainty pruning. Each method iteratively removes the least informative heads while monitoring validation accuracy to avoid over-pruning. Experimental evaluation shows that gradient-based pruning performs best while defending the syntactic triggers, whereas reinforcement learning and Bayesian pruning better withstand stylistic attacks.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20032/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20032/full.md

## References

51 references — full list in the complete paper: https://tomesphere.com/paper/2508.20032/full.md

---
Source: https://tomesphere.com/paper/2508.20032