# GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

**Authors:** Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Zelei Cheng, Haohan Wang

arXiv: 2508.20325 · 2026-05-12

## TL;DR

GUARD is a novel testing framework that operationalizes ethics guidelines into specific questions and scenarios to evaluate and improve LLM compliance and safety.

## Contribution

It introduces GUARD, a method that generates guideline-violating questions and jailbreak scenarios to assess LLM adherence to ethical standards.

## Key findings

- Effectively identified violations in eight diverse LLMs.
- Successfully transferred jailbreak diagnostics to vision-language models.
- Provided comprehensive compliance reports for multiple guidelines.

## Abstract

As Large Language Models (LLMs) become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We empirically validated the effectiveness of GUARD on eight LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models (MiniGPT-v2 and Gemini-1.5), demonstrating its usage in promoting reliable LLM-based applications.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20325/full.md

## Figures

26 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20325/full.md

---
Source: https://tomesphere.com/paper/2508.20325