GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing
Peiyan Zhang, Haibo Jin, Liying Kang, Haohan Wang

TL;DR
This paper introduces GuardVal, a dynamic evaluation protocol for assessing LLM safety by generating and refining jailbreak prompts, revealing vulnerabilities across diverse models and safety domains.
Contribution
We propose GuardVal, a novel adaptive evaluation method that improves jailbreak testing of LLMs by dynamically generating prompts and preventing stagnation during refinement.
Findings
Different models exhibit unique vulnerability patterns
GuardVal uncovers deeper weaknesses in LLMs
Evaluation enhances understanding of LLM safety behaviors
Abstract
Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the sophistication required in effectively probing their vulnerabilities. Current benchmarks and evaluation methods struggle to fully address these challenges, leaving gaps in the assessment of LLM vulnerabilities. In this paper, we review existing jailbreak evaluation practices and identify three assumed desiderata for an effective jailbreak evaluation protocol. To address these challenges, we introduce GuardVal, a new evaluation protocol that dynamically generates and refines jailbreak prompts based on the defender LLM's state, providing a more accurate assessment of defender LLMs' capacity to handle safety-critical situations. Moreover, we propose a new…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The paper makes some useful contributions for subsequent work, such as a list of desiderata for jailbreak evaluation, and a dynamic jailbreak prompt generation approach, with a somewhat novel optimizer module to ensure no stagnation in generated attacks. 2. The optimizer design is also somewhat innovative, drawing inspiration from the Adam optimizer to maintain estimates of the change in responses over time, and converting these estimates to different sets of natural language prompts provided
1. In my opinion, presentation is one of the biggest issues holding this paper back. The background and OSV sections feel too verbose, and the jailbreak generation method is barely described in the main section of the paper. 2. There are other methodological concerns as well. The paper claims that the Optimizer results in a more diverse set of jailbreak attacks, but no analysis is presented confirming this hypothesis. 3. The jailbreak generation method is not compared to other methods in recent
1. Originality: the paper proposed a novel protocol to dynamically evaluate a group of LLMs in their capacity to defend jailbreak attacks by asking LLMs to generate jailbreak prompts for each other. 2. Quality: The method seems to execute well in experiment and have good result that matches existing LLM jailbreak benchmark. 3. Clarity: The paper is well-written and key information like prompt is provided in appendix. 4. Significance: The paper addressed the limitation of traditional human-labor-
1. The calculation of Overall Safety Value seems to be a bit arbitrary. Is Offensive Capability considered as important as Defensive Capability in the formula? Will LLMs with good offensive capability get advantages that are more than expected? Is there any correlation or consistency between defensive and offensive capability of LLMs? Analyzing values and ranking with either one capability respectively may help empirically "justify the relationship between offensive and safety".
1. The motivation is great and clear: The fixed nature of existing jailbreak datasets and benchmarks results in insufficient evaluation of consistently updated LLMs. 2. This paper is easy to follow. 3. Although not perfect, the metric Overall Safety Value is interesting and inspiring.
1. The rationale behind the overall safety value, especially the offensive capability. Although highlighted in Discussion, this reviewer is concerned about the rationale behind the design of offensive capability. Specifically, according to Equation 1, we can conclude that if an LLM's attack capability is stronger, then its OSV value will be worse. The authors should articulate a deeper rationale for such a design, e.g., why an LLM's ability to attack a target model can be correlated with its own
- The method is clearly presented but not very original, as other existing works have already implemented dynamic safety and jailbreak evaluation. - The evaluation is robust and well-conducted, and the results presentation is clear and comprehensive. - The prompt generation aligns with international standards, enhancing the method’s reliability and applicability across diverse contexts.
- The introductory and background sections are overly detailed, occupying substantial space before introducing the method almost on page 5. I'd resume these sections to around 2 pages to allow for a more detailed explanation of the core methodology. - The Optimizer component, crucial in the evaluation pipeline, lacks sufficient explanation. I would better clarify its role and importance within the process. - While the reason to combine attacking and defensive scores is explained, I don't underst
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Information and Cyber Security · Web Application Security Vulnerabilities
