Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge

Ansh Arora; Xuanli He; Maximilian Mozes; Srinibas Swain; Mark Dras,; and Qiongkai Xu

arXiv:2402.19334·cs.CL·June 4, 2024·1 cites

Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge

Ansh Arora, Xuanli He, Maximilian Mozes, Srinibas Swain, Mark Dras,, and Qiongkai Xu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper proposes a simple, resource-efficient method of merging models to effectively neutralize backdoor vulnerabilities in NLP models, enhancing security without additional costs.

Contribution

It introduces model merging as a novel, effective defense against backdoor attacks in large language models, outperforming existing methods.

Findings

01

Achieves about 75% reduction in attack success rate.

02

Effective across various models and datasets.

03

No additional resources needed for defense.

Abstract

The democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. However, this openness also brings significant security risks, including backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities even if such models are not entirely secure. In our experiments, we verify our hypothesis on various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive approaches, our method offers an effective and efficient inference-stage defense against backdoor attacks on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ansharora7/model-merge-backdoor
pytorchOfficial

Videos

Here’s a Free Lunch: Sanitizing Backdoored Models with Model Merge· underline

Taxonomy

TopicsDigital Rights Management and Security · Digitalization, Law, and Regulation