Securing Multi-turn Conversational Language Models From Distributed   Backdoor Triggers

Terry Tong; Jiashu Xu; Qin Liu; Muhao Chen

arXiv:2407.04151·cs.CL·October 29, 2024

Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers

Terry Tong, Jiashu Xu, Qin Liu, Muhao Chen

PDF

Open Access 1 Repo

TL;DR

This paper reveals a vulnerability in multi-turn conversational language models where distributed backdoor triggers can be activated by specific token combinations, and proposes a scalable decoding-time defense to mitigate this threat.

Contribution

It uncovers the existence of combinational backdoor triggers in LLMs and introduces a novel, efficient decoding-time defense method to reduce backdoor success rates.

Findings

01

Single token insertion can cause over 99% attack success rate.

02

Backdoor representation is invariant to trigger position.

03

Proposed decoding-time defense reduces backdoor success to 0.35%.

Abstract

Large language models (LLMs) have acquired the ability to handle longer context lengths and understand nuances in text, expanding their dialogue capabilities beyond a single utterance. A popular user-facing application of LLMs is the multi-turn chat setting. Though longer chat memory and better understanding may seemingly benefit users, our paper exposes a vulnerability that leverages the multi-turn feature and strong learning ability of LLMs to harm the end-user: the backdoor. We demonstrate that LLMs can capture the combinational backdoor representation. Only upon presentation of triggers together does the backdoor activate. We also verify empirically that this representation is invariant to the position of the trigger utterance. Subsequently, inserting a single extra token into two utterances of 5%of the data can cause over 99% Attack Success Rate (ASR). Our results with 3 triggers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

terrytong-git/poisonshare
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling