AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Pankayaraj Pathmanathan, Udari Madhushani Sehwag, Michael-Andrei Panaitescu-Liess, Furong Huang

TL;DR
AdvBDGen introduces an adversarial framework to generate stealthy, prompt-specific backdoors in LLMs, demonstrating effective jailbreaks with minimal data and highlighting the need for robust defenses against such threats.
Contribution
The paper presents AdvBDGen, a novel adversarially fortified generative approach for creating stealthy, transferable backdoors in LLMs using prompt-specific paraphrases, improving over fixed triggers.
Findings
Backdoors effective with only 3% fine-tuning data
Backdoors are more stealthy and resistant to removal
Backdoors can successfully jailbreak LLMs during inference
Abstract
With the growing adoption of reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs), the risk of backdoor installation during alignment has increased, leading to unintended and harmful behaviors. Existing backdoor triggers are typically limited to fixed word patterns, making them detectable during data cleaning and easily removable post-poisoning. In this work, we explore the use of prompt-specific paraphrases as backdoor triggers, enhancing their stealth and resistance to removal during LLM alignment. We propose AdvBDGen, an adversarially fortified generative fine-tuning framework that automatically generates prompt-specific backdoors that are effective, stealthy, and transferable across models. AdvBDGen employs a generator-discriminator pair, fortified by an adversary, to ensure the installability and stealthiness of backdoors. It enables the…
Peer Reviews
Decision·Submitted to ICLR 2025
- Originality This paper proposes a novel approach for generating adaptive, prompt-specific backdoor triggers that are context-aware and resistant to traditional defense mechanisms. By using adversarial training with dual discriminators (weak and strong), the framework encourages the generation of subtle, complex triggers that evade detection more effectively than fixed-pattern backdoors. - Quality The paper presents rigorous experiments, evaluating the generated backdoors’ effectiveness and
- Stylistic [1] and Syntactic [2] backdoor triggers can serve as a competing baseline method. [1] Fanchao Qi et al. Mind the style of text! adversarial and backdoor attacks based on text style transfer. [2] Fanchao Qi et al. Hidden killer: Invisible textual backdoor attacks with syntactic trigger.
1. Unlike prior works that use fixed triggers, the paper leverages GAN-style training to explore context-adaptive and paraphrase-based backdoors that are harder to detect and neutralize. 2. The comparative analysis between naive LLM-generated paraphrases and those produced by AdvBDGen underscores the method's efficacy and robustness. 3. The work covers various dimensions of stealthiness and transferability, showcasing how the proposed backdoors withstand semantic perturbations and transfer acros
1. The authors assert that constant triggers can be easily detected; however, they did not provide experimental evidence to support this claim. Notably, constant triggers have shown higher attack performance even at low poisoning rates, which are typically more challenging to detect. This raises the question of whether the proposed method, which involves complex backdoor generation steps, is truly necessary. I recommend providing additional experimental evidence to justify that the benefits of t
This paper focuses on an important research problem regarding the backdoor attacks in the LLMs alignment stage. From the attacker's perspective, the authors propose an attack approach named AdvBDGen that has been proven effective in the empirical experiments in the following aspects: 1. The injection of the triggers is easier with just 3% of the fine-tuning data. 2. The injected triggers are more stable compared to previous approaches. 3. The injected triggers are stealthy, making them hard to d
1. The overall writing and presentation of this paper should be largely improved. For the current version, it's hard to get an overall understanding of the proposed attack in the abstract and introduction (i.e., how and why the method can work, and the key motivation). The figure 1 doesn't provide too much useful information. Some concrete examples of poisoned samples can be better. 2. The key motivation of this work and the novelty of the proposed approach are not very clear to me. To the best
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Imbalanced Data Classification Techniques
