Exploiting the Vulnerability of Large Language Models via Defense-Aware   Architectural Backdoor

Abdullah Arafat Miah; Yu Bi

arXiv:2409.01952·cs.CR·September 10, 2024

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Abdullah Arafat Miah, Yu Bi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel white-box backdoor attack on large language models that embeds hidden modules within the model architecture, capable of evading fine-tuning and existing defenses, posing significant security risks.

Contribution

It proposes a new architectural backdoor method with trigger detection and noise injection modules that survive retraining and evade defenses.

Findings

01

The attack remains effective after fine-tuning and retraining.

02

The backdoor can evade probability-based defense methods like BDDR.

03

The method is validated on multiple datasets and model architectures.

Abstract

Deep neural networks (DNNs) have long been recognized as vulnerable to backdoor attacks. By providing poisoned training data in the fine-tuning process, the attacker can implant a backdoor into the victim model. This enables input samples meeting specific textual trigger patterns to be classified as target labels of the attacker's choice. While such black-box attacks have been well explored in both computer vision and natural language processing (NLP), backdoor attacks relying on white-box attack philosophy have hardly been thoroughly investigated. In this paper, we take the first step to introduce a new type of backdoor attack that conceals itself within the underlying model architecture. Specifically, we propose to design separate backdoor modules consisting of two functions: trigger detection and noise injection. The add-on modules of model architecture layers can detect the presence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sisl-uri/arch_backdoor_llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAccess Control and Trust · Adversarial Robustness in Machine Learning · Digital and Cyber Forensics