Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers
Terry Tong, Jiashu Xu, Qin Liu, Muhao Chen

TL;DR
This paper reveals a vulnerability in multi-turn conversational language models where distributed backdoor triggers can be activated by specific token combinations, and proposes a scalable decoding-time defense to mitigate this threat.
Contribution
It uncovers the existence of combinational backdoor triggers in LLMs and introduces a novel, efficient decoding-time defense method to reduce backdoor success rates.
Findings
Single token insertion can cause over 99% attack success rate.
Backdoor representation is invariant to trigger position.
Proposed decoding-time defense reduces backdoor success to 0.35%.
Abstract
Large language models (LLMs) have acquired the ability to handle longer context lengths and understand nuances in text, expanding their dialogue capabilities beyond a single utterance. A popular user-facing application of LLMs is the multi-turn chat setting. Though longer chat memory and better understanding may seemingly benefit users, our paper exposes a vulnerability that leverages the multi-turn feature and strong learning ability of LLMs to harm the end-user: the backdoor. We demonstrate that LLMs can capture the combinational backdoor representation. Only upon presentation of triggers together does the backdoor activate. We also verify empirically that this representation is invariant to the position of the trigger utterance. Subsequently, inserting a single extra token into two utterances of 5%of the data can cause over 99% Attack Success Rate (ASR). Our results with 3 triggers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
