GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

Peiyan Zhang; Haibo Jin; Liying Kang; Haohan Wang

arXiv:2507.07735·cs.LG·July 11, 2025

GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

Peiyan Zhang, Haibo Jin, Liying Kang, Haohan Wang

PDF

Open Access 4 Reviews

TL;DR

This paper introduces GuardVal, a dynamic evaluation protocol for assessing LLM safety by generating and refining jailbreak prompts, revealing vulnerabilities across diverse models and safety domains.

Contribution

We propose GuardVal, a novel adaptive evaluation method that improves jailbreak testing of LLMs by dynamically generating prompts and preventing stagnation during refinement.

Findings

01

Different models exhibit unique vulnerability patterns

02

GuardVal uncovers deeper weaknesses in LLMs

03

Evaluation enhances understanding of LLM safety behaviors

Abstract

Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the sophistication required in effectively probing their vulnerabilities. Current benchmarks and evaluation methods struggle to fully address these challenges, leaving gaps in the assessment of LLM vulnerabilities. In this paper, we review existing jailbreak evaluation practices and identify three assumed desiderata for an effective jailbreak evaluation protocol. To address these challenges, we introduce GuardVal, a new evaluation protocol that dynamically generates and refines jailbreak prompts based on the defender LLM's state, providing a more accurate assessment of defender LLMs' capacity to handle safety-critical situations. Moreover, we propose a new…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 5

Strengths

1. The paper makes some useful contributions for subsequent work, such as a list of desiderata for jailbreak evaluation, and a dynamic jailbreak prompt generation approach, with a somewhat novel optimizer module to ensure no stagnation in generated attacks. 2. The optimizer design is also somewhat innovative, drawing inspiration from the Adam optimizer to maintain estimates of the change in responses over time, and converting these estimates to different sets of natural language prompts provided

Weaknesses

1. In my opinion, presentation is one of the biggest issues holding this paper back. The background and OSV sections feel too verbose, and the jailbreak generation method is barely described in the main section of the paper. 2. There are other methodological concerns as well. The paper claims that the Optimizer results in a more diverse set of jailbreak attacks, but no analysis is presented confirming this hypothesis. 3. The jailbreak generation method is not compared to other methods in recent

Reviewer 02Rating 8Confidence 3

Strengths

1. Originality: the paper proposed a novel protocol to dynamically evaluate a group of LLMs in their capacity to defend jailbreak attacks by asking LLMs to generate jailbreak prompts for each other. 2. Quality: The method seems to execute well in experiment and have good result that matches existing LLM jailbreak benchmark. 3. Clarity: The paper is well-written and key information like prompt is provided in appendix. 4. Significance: The paper addressed the limitation of traditional human-labor-

Weaknesses

1. The calculation of Overall Safety Value seems to be a bit arbitrary. Is Offensive Capability considered as important as Defensive Capability in the formula? Will LLMs with good offensive capability get advantages that are more than expected? Is there any correlation or consistency between defensive and offensive capability of LLMs? Analyzing values and ranking with either one capability respectively may help empirically "justify the relationship between offensive and safety".

Reviewer 03Rating 5Confidence 4

Strengths

1. The motivation is great and clear: The fixed nature of existing jailbreak datasets and benchmarks results in insufficient evaluation of consistently updated LLMs. 2. This paper is easy to follow. 3. Although not perfect, the metric Overall Safety Value is interesting and inspiring.

Weaknesses

1. The rationale behind the overall safety value, especially the offensive capability. Although highlighted in Discussion, this reviewer is concerned about the rationale behind the design of offensive capability. Specifically, according to Equation 1, we can conclude that if an LLM's attack capability is stronger, then its OSV value will be worse. The authors should articulate a deeper rationale for such a design, e.g., why an LLM's ability to attack a target model can be correlated with its own

Reviewer 04Rating 3Confidence 4

Strengths

- The method is clearly presented but not very original, as other existing works have already implemented dynamic safety and jailbreak evaluation. - The evaluation is robust and well-conducted, and the results presentation is clear and comprehensive. - The prompt generation aligns with international standards, enhancing the method’s reliability and applicability across diverse contexts.

Weaknesses

- The introductory and background sections are overly detailed, occupying substantial space before introducing the method almost on page 5. I'd resume these sections to around 2 pages to allow for a more detailed explanation of the core methodology. - The Optimizer component, crucial in the evaluation pipeline, lacks sufficient explanation. I would better clarify its role and importance within the process. - While the reason to combine attacking and defensive scores is explained, I don't underst

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Information and Cyber Security · Web Application Security Vulnerabilities