SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation
Lai Jiang, Yuekang Li, Xiaohan Zhang, Youtao Ding, Li Pan

TL;DR
SceneJailEval introduces a scenario-adaptive, multi-dimensional framework for evaluating LLM jailbreaks, addressing limitations of existing methods by providing accurate, scenario-specific assessments and a comprehensive dataset.
Contribution
It proposes a novel, adaptable evaluation framework and a rich dataset, significantly improving jailbreak assessment accuracy across diverse scenarios.
Findings
Achieved an F1 score of 0.917 on the full-scenario dataset
Outperformed state-of-the-art methods by 6% in F1 score
Demonstrated robustness and extensibility across multiple scenarios
Abstract
Accurate jailbreak evaluation is critical for LLM red team testing and jailbreak research. Mainstream methods rely on binary classification (string matching, toxic text classifiers, and LLM-based methods), outputting only "yes/no" labels without quantifying harm severity. Emerged multi-dimensional frameworks (e.g., Security Violation, Relative Truthfulness and Informativeness) use unified evaluation standards across scenarios, leading to scenario-specific mismatches (e.g., "Relative Truthfulness" is irrelevant to "hate speech"), undermining evaluation accuracy. To address these, we propose SceneJailEval, with key contributions: (1) A pioneering scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical "one-size-fits-all" limitation of existing multi-dimensional methods, and boasting robust extensibility to seamlessly adapt to customized or emerging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics
