Emerging Safety Attack and Defense in Federated Instruction Tuning of   Large Language Models

Rui Ye; Jingyi Chai; Xiangrui Liu; Yaodong Yang; Yanfeng Wang; Siheng; Chen

arXiv:2406.10630·cs.CL·June 18, 2024

Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models

Rui Ye, Jingyi Chai, Xiangrui Liu, Yaodong Yang, Yanfeng Wang, Siheng, Chen

PDF

Open Access

TL;DR

This paper uncovers a new safety attack in federated instruction tuning of large language models, demonstrating its effectiveness and proposing a post-hoc defense method that significantly improves safety alignment.

Contribution

It introduces the first safety attack method for FedIT and a novel automated defense pipeline to enhance LLM safety against such attacks.

Findings

01

Safety attack reduces LLM safety rate by 70%

02

Existing defenses are ineffective against the new attack

03

Proposed defense improves safety alignment by up to 69%

Abstract

Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing. Ideally, by training on decentralized data that is aligned with human preferences and safety principles, federated instruction tuning can result in an LLM that could behave in a helpful and safe manner. In this paper, we for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method. Specifically, the malicious clients could automatically generate attack data without involving manual efforts and attack the FedIT system by training their local LLMs on such attack data. Unfortunately, this proposed safety attack not only can compromise the safety alignment of LLM trained via FedIT, but also can not be effectively defended against by many existing FL defense methods.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data