GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Haibo Jin; Ruoxi Chen; Peiyan Zhang; Andy Zhou; Haohan Wang

arXiv:2402.03299·cs.LG·November 10, 2025·2 cites

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Haohan Wang

PDF

Open Access

TL;DR

GUARD is a novel role-playing framework that generates effective jailbreaks to test and improve the safety and guideline adherence of large language models across multiple modalities.

Contribution

It introduces a role-playing system leveraging a knowledge graph of jailbreak characteristics to automatically generate guideline-following jailbreaks, enhancing safety testing of LLMs.

Findings

01

Effective in inducing guideline violations in multiple LLMs

02

Applicable to both text-based and vision-language models

03

Demonstrates versatility across different model architectures

Abstract

The discovery of "jailbreaks" to bypass safety filters of Large Language Models (LLMs) and harmful responses have encouraged the community to implement safety measures. One major safety measure is to proactively test the LLMs with jailbreaks prior to the release. Therefore, such testing will require a method that can generate jailbreaks massively and efficiently. In this paper, we follow a novel yet intuitive strategy to generate jailbreaks in the style of the human generation. We propose a role-playing system that assigns four different roles to the user LLMs to collaborate on new jailbreaks. Furthermore, we collect existing jailbreaks and split them into different independent characteristics using clustering frequency and semantic patterns sentence by sentence. We organize these characteristics into a knowledge graph, making them more accessible and easier to retrieve. Our system of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Artificial Intelligence in Healthcare and Education · Digital and Cyber Forensics