Exploring Backdoor Vulnerabilities of Chat Models

Yunzhuo Hao; Wenkai Yang; Yankai Lin

arXiv:2404.02406·cs.CR·April 4, 2024·2 cites

Exploring Backdoor Vulnerabilities of Chat Models

Yunzhuo Hao, Wenkai Yang, Yankai Lin

PDF

Open Access 1 Repo 4 Models

TL;DR

This paper uncovers a novel backdoor attack method on chat models that exploits multi-turn interaction formats, achieving high success rates while preserving normal functionality, raising security concerns for widely used conversational AI.

Contribution

It introduces a new backdoor attack technique tailored for chat models, demonstrating its effectiveness and resilience against re-alignment defenses.

Findings

01

Achieves over 90% attack success rate on Vicuna-7B.

02

Backdoor remains effective despite downstream re-alignment.

03

Multi-turn interaction increases vulnerability to backdoor triggers.

Abstract

Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. The backdoored model will behave well in normal cases but exhibit malicious behaviours on inputs inserted with a specific backdoor trigger. Current backdoor studies on LLMs predominantly focus on instruction-tuned LLMs, while neglecting another realistic scenario where LLMs are fine-tuned on multi-turn conversational data to be chat models. Chat models are extensively adopted across various real-world scenarios, thus the security of chat models deserves increasing attention. Unfortunately, we point out that the flexible multi-turn interaction format instead increases the flexibility of trigger designs and amplifies the vulnerability of chat models to backdoor attacks. In this work, we reveal and achieve a novel backdoor attacking method on chat models by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hychaochao/chat-models-backdoor-attacking
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Security and Verification in Computing · Web Application Security Vulnerabilities

MethodsFocus