GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Haohan Wang

TL;DR
GUARD is a novel role-playing framework that generates effective jailbreaks to test and improve the safety and guideline adherence of large language models across multiple modalities.
Contribution
It introduces a role-playing system leveraging a knowledge graph of jailbreak characteristics to automatically generate guideline-following jailbreaks, enhancing safety testing of LLMs.
Findings
Effective in inducing guideline violations in multiple LLMs
Applicable to both text-based and vision-language models
Demonstrates versatility across different model architectures
Abstract
The discovery of "jailbreaks" to bypass safety filters of Large Language Models (LLMs) and harmful responses have encouraged the community to implement safety measures. One major safety measure is to proactively test the LLMs with jailbreaks prior to the release. Therefore, such testing will require a method that can generate jailbreaks massively and efficiently. In this paper, we follow a novel yet intuitive strategy to generate jailbreaks in the style of the human generation. We propose a role-playing system that assigns four different roles to the user LLMs to collaborate on new jailbreaks. Furthermore, we collect existing jailbreaks and split them into different independent characteristics using clustering frequency and semantic patterns sentence by sentence. We organize these characteristics into a knowledge graph, making them more accessible and easier to retrieve. Our system of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Artificial Intelligence in Healthcare and Education · Digital and Cyber Forensics
