Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor
Abdullah Arafat Miah, Yu Bi

TL;DR
This paper introduces a novel white-box backdoor attack on large language models that embeds hidden modules within the model architecture, capable of evading fine-tuning and existing defenses, posing significant security risks.
Contribution
It proposes a new architectural backdoor method with trigger detection and noise injection modules that survive retraining and evade defenses.
Findings
The attack remains effective after fine-tuning and retraining.
The backdoor can evade probability-based defense methods like BDDR.
The method is validated on multiple datasets and model architectures.
Abstract
Deep neural networks (DNNs) have long been recognized as vulnerable to backdoor attacks. By providing poisoned training data in the fine-tuning process, the attacker can implant a backdoor into the victim model. This enables input samples meeting specific textual trigger patterns to be classified as target labels of the attacker's choice. While such black-box attacks have been well explored in both computer vision and natural language processing (NLP), backdoor attacks relying on white-box attack philosophy have hardly been thoroughly investigated. In this paper, we take the first step to introduce a new type of backdoor attack that conceals itself within the underlying model architecture. Specifically, we propose to design separate backdoor modules consisting of two functions: trigger detection and noise injection. The add-on modules of model architecture layers can detect the presence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAccess Control and Trust · Adversarial Robustness in Machine Learning · Digital and Cyber Forensics
