Is the System Message Really Important to Jailbreaks in Large Language Models?
Xiaotian Zou, Yongkang Chen, Ke Li

TL;DR
This paper investigates the impact of system message configurations on the susceptibility of large language models to jailbreak prompts, proposing an evolutionary algorithm to enhance system message robustness against malicious prompts.
Contribution
It introduces the System Messages Evolutionary Algorithm (SMEA) to generate more resistant system messages, and provides experimental evidence on how message length and content affect jailbreak resistance.
Findings
Different system messages have varying resistance to jailbreaks.
SMEA can generate robust system messages with minimal length changes.
Transferability of jailbreaks varies across models with different system messages.
Abstract
The rapid evolution of Large Language Models (LLMs) has rendered them indispensable in modern society. While security measures are typically to align LLMs with human values prior to release, recent studies have unveiled a concerning phenomenon named "Jailbreak". This term refers to the unexpected and potentially harmful responses generated by LLMs when prompted with malicious questions. Most existing research focus on generating jailbreak prompts but system message configurations vary significantly in experiments. In this paper, we aim to answer a question: Is the system message really important for jailbreaks in LLMs? We conduct experiments in mainstream LLMs to generate jailbreak prompts with varying system messages: short, long, and none. We discover that different system messages have distinct resistances to jailbreaks. Therefore, we explore the transferability of jailbreaks across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Digital and Cyber Forensics
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Focus · Cosine Annealing · Residual Connection · Linear Layer · Discriminative Fine-Tuning · Byte Pair Encoding · Linear Warmup With Cosine Annealing · Weight Decay · Dropout
