Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge
Ansh Arora, Xuanli He, Maximilian Mozes, Srinibas Swain, Mark Dras,, and Qiongkai Xu

TL;DR
This paper proposes a simple, resource-efficient method of merging models to effectively neutralize backdoor vulnerabilities in NLP models, enhancing security without additional costs.
Contribution
It introduces model merging as a novel, effective defense against backdoor attacks in large language models, outperforming existing methods.
Findings
Achieves about 75% reduction in attack success rate.
Effective across various models and datasets.
No additional resources needed for defense.
Abstract
The democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. However, this openness also brings significant security risks, including backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities even if such models are not entirely secure. In our experiments, we verify our hypothesis on various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive approaches, our method offers an effective and efficient inference-stage defense against backdoor attacks on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Rights Management and Security · Digitalization, Law, and Regulation
