ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat   Templates

Fengqing Jiang; Zhangchen Xu; Luyao Niu; Bill Yuchen Lin; Radha; Poovendran

arXiv:2406.12935·cs.CR·January 8, 2025

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha, Poovendran

PDF

Open Access 1 Repo

TL;DR

This paper uncovers a vulnerability called ChatBug in aligned LLMs caused by chat templates, showing how malicious prompts can bypass safety measures, and discusses mitigation strategies with trade-offs.

Contribution

The paper identifies ChatBug, a novel vulnerability in LLMs induced by chat templates, and develops attacks demonstrating its exploitation, highlighting challenges in balancing safety and performance.

Findings

01

ChatBug can be exploited to bypass safety in 8 SOTA LLMs

02

Existing jailbreaks are more effective when combined with ChatBug

03

Adversarial training mitigates ChatBug but reduces model performance

Abstract

Large language models (LLMs) are expected to follow instructions from users and engage in conversations. Techniques to enhance LLMs' instruction-following capabilities typically fine-tune them using data structured according to a predefined chat template. Although chat templates are shown to be effective in optimizing LLM performance, their impact on safety alignment of LLMs has been less understood, which is crucial for deploying LLMs safely at scale. In this paper, we investigate how chat templates affect safety alignment of LLMs. We identify a common vulnerability, named ChatBug, that is introduced by chat templates. Our key insight to identify ChatBug is that the chat templates provide a rigid format that need to be followed by LLMs, but not by users. Hence, a malicious user may not necessarily follow the chat template when prompting LLMs. Instead, malicious users could leverage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uw-nsl/ChatBug
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Application Security Vulnerabilities · Advanced Malware Detection Techniques · Spam and Phishing Detection