Exploring Backdoor Vulnerabilities of Chat Models
Yunzhuo Hao, Wenkai Yang, Yankai Lin

TL;DR
This paper uncovers a novel backdoor attack method on chat models that exploits multi-turn interaction formats, achieving high success rates while preserving normal functionality, raising security concerns for widely used conversational AI.
Contribution
It introduces a new backdoor attack technique tailored for chat models, demonstrating its effectiveness and resilience against re-alignment defenses.
Findings
Achieves over 90% attack success rate on Vicuna-7B.
Backdoor remains effective despite downstream re-alignment.
Multi-turn interaction increases vulnerability to backdoor triggers.
Abstract
Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. The backdoored model will behave well in normal cases but exhibit malicious behaviours on inputs inserted with a specific backdoor trigger. Current backdoor studies on LLMs predominantly focus on instruction-tuned LLMs, while neglecting another realistic scenario where LLMs are fine-tuned on multi-turn conversational data to be chat models. Chat models are extensively adopted across various real-world scenarios, thus the security of chat models deserves increasing attention. Unfortunately, we point out that the flexible multi-turn interaction format instead increases the flexibility of trigger designs and amplifies the vulnerability of chat models to backdoor attacks. In this work, we reveal and achieve a novel backdoor attacking method on chat models by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Security and Verification in Computing · Web Application Security Vulnerabilities
MethodsFocus
