GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods
Ruixuan Huang, Xunguang Wang, Zongjie Li, Daoyuan Wu, Shuai Wang

TL;DR
This paper introduces GuidedBench, a comprehensive benchmark with case-specific guidelines for evaluating LLM jailbreak methods, addressing evaluation discrepancies and improving measurement accuracy and consistency.
Contribution
It presents GuidedBench and GuidedEval, a new evaluation framework that enhances accuracy, reduces variance, and provides deeper insights into jailbreak effectiveness and evaluation practices.
Findings
GuidedBench improves measurement accuracy of jailbreak methods.
GuidedEval reduces inter-evaluator variance by over 76%.
Incorporating guidelines enhances jailbreak attack effectiveness.
Abstract
Despite the growing interest in jailbreak methods as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. We conduct a systematic measurement study based on 37 jailbreak studies since 2022, focusing on both the methods and the evaluation systems they employ. We find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. This paper advocates a shift to a more nuanced, case-by-case evaluation paradigm. We introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset, detailed case-by-case evaluation guidelines and an evaluation system integrated with these guidelines -- GuidedEval. Experiments demonstrate that GuidedBench…
Peer Reviews
Decision·ICLR 2026 Poster
(1) Clear motivation and problem identification. \ The paper isolates a key weakness in current jailbreak research: inflated or unstable ASR estimates caused by loosely defined evaluation metrics. It argues persuasively that improving the precision and reproducibility of evaluation is essential to progress in this area. The problem framing is clear and well justified. (2) Rigorous, interpretable, and actionability-based scoring design. \ The entity/action rubric formalizes jailbreak success in
A recurring theme in the paper’s claims is the improved stability and reliability of GuidedEval across multiple axes of variation. These axes can be grouped into three distinct dimensions: (A) stability across victim LLMs and harmful-topic categories, (B) consistency across evaluation schemes and their resulting method rankings, and (C) agreement across evaluator LLMs used to judge the same responses. Each dimension captures a different aspect of robustness: (A) tests whether the measured effect
It's an important problem, that current jailbreak methods produce wildly inconsistent results with often a large number of attempts. Keyword based methods can even reverse rankings of attack efficiency. The paper is quite comprehensive and evaluates a large number of cases across 10 attack methods and gives thorough evidence. There are concrete entities and actions to actually help attacks, which makes it more objective as well as interpretable, and the 76% reduction is quite important. The agre
Assumes all scoring points equally important, somewhat unrealistic practically - some entities are clearly more critical for harm. The dataset size is fairly small with no strong analysis on that. There is not any evaluation on completely novel harmful question types outside the predefined categories. The dataset filteration could introduce selection bias and miss important information. (authors admit guidelines can miss non-essential but relevant details). Evaluators are “less safety-res
1. The data and system are effective in assessing jailbreak capabilities. 2. The method addresses a gap in the current evaluation of jailbreak attacks. 3. Analysis and results aim to provide a more accurate view of the field compared to other methods.
1. Many parts of the paper are lacking in description or rely on the appendix, making them hard to follow. For example, there is no explanation of the experimental setup for the "Misjudged Cases" part. 2. The writing makes it hard to understand how the conclusions are drawn from the raw results. For example, after Tables 2 and 3, there are very general takeaways that do not refer to the table and the results. 3. This claim: "Our leaderboard analysis reveals that prior evaluation systems, especia
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning
MethodsSoftmax · Attention Is All You Need
