GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods

Ruixuan Huang; Xunguang Wang; Zongjie Li; Daoyuan Wu; Shuai Wang

arXiv:2502.16903·cs.CL·July 10, 2025

GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods

Ruixuan Huang, Xunguang Wang, Zongjie Li, Daoyuan Wu, Shuai Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces GuidedBench, a comprehensive benchmark with case-specific guidelines for evaluating LLM jailbreak methods, addressing evaluation discrepancies and improving measurement accuracy and consistency.

Contribution

It presents GuidedBench and GuidedEval, a new evaluation framework that enhances accuracy, reduces variance, and provides deeper insights into jailbreak effectiveness and evaluation practices.

Findings

01

GuidedBench improves measurement accuracy of jailbreak methods.

02

GuidedEval reduces inter-evaluator variance by over 76%.

03

Incorporating guidelines enhances jailbreak attack effectiveness.

Abstract

Despite the growing interest in jailbreak methods as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. We conduct a systematic measurement study based on 37 jailbreak studies since 2022, focusing on both the methods and the evaluation systems they employ. We find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. This paper advocates a shift to a more nuanced, case-by-case evaluation paradigm. We introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset, detailed case-by-case evaluation guidelines and an evaluation system integrated with these guidelines -- GuidedEval. Experiments demonstrate that GuidedBench…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

(1) Clear motivation and problem identification. \ The paper isolates a key weakness in current jailbreak research: inflated or unstable ASR estimates caused by loosely defined evaluation metrics. It argues persuasively that improving the precision and reproducibility of evaluation is essential to progress in this area. The problem framing is clear and well justified. (2) Rigorous, interpretable, and actionability-based scoring design. \ The entity/action rubric formalizes jailbreak success in

Weaknesses

A recurring theme in the paper’s claims is the improved stability and reliability of GuidedEval across multiple axes of variation. These axes can be grouped into three distinct dimensions: (A) stability across victim LLMs and harmful-topic categories, (B) consistency across evaluation schemes and their resulting method rankings, and (C) agreement across evaluator LLMs used to judge the same responses. Each dimension captures a different aspect of robustness: (A) tests whether the measured effect

Reviewer 02Rating 8Confidence 4

Strengths

It's an important problem, that current jailbreak methods produce wildly inconsistent results with often a large number of attempts. Keyword based methods can even reverse rankings of attack efficiency. The paper is quite comprehensive and evaluates a large number of cases across 10 attack methods and gives thorough evidence. There are concrete entities and actions to actually help attacks, which makes it more objective as well as interpretable, and the 76% reduction is quite important. The agre

Weaknesses

Assumes all scoring points equally important, somewhat unrealistic practically - some entities are clearly more critical for harm. The dataset size is fairly small with no strong analysis on that. There is not any evaluation on completely novel harmful question types outside the predefined categories. The dataset filteration could introduce selection bias and miss important information. (authors admit guidelines can miss non-essential but relevant details). Evaluators are “less safety-res

Reviewer 03Rating 4Confidence 3

Strengths

1. The data and system are effective in assessing jailbreak capabilities. 2. The method addresses a gap in the current evaluation of jailbreak attacks. 3. Analysis and results aim to provide a more accurate view of the field compared to other methods.

Weaknesses

1. Many parts of the paper are lacking in description or rely on the appendix, making them hard to follow. For example, there is no explanation of the experimental setup for the "Misjudged Cases" part. 2. The writing makes it hard to understand how the conclusions are drawn from the raw results. For example, after Tables 2 and 3, there are very general takeaways that do not refer to the table and the results. 3. This claim: "Our leaderboard analysis reveals that prior evaluation systems, especia

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning

MethodsSoftmax · Attention Is All You Need