JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent   Enhanced Explanation Evaluation Framework

Fan Liu; Yue Feng; Zhao Xu; Lixin Su; Xinyu Ma; Dawei Yin; Hao Liu

arXiv:2410.12855·cs.CL·October 21, 2024·3 cites

JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework

Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, Hao Liu

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

JAILJUDGE introduces a comprehensive benchmark and multi-agent evaluation framework for assessing LLM safety against jailbreak attacks, emphasizing explainability, multilingual scenarios, and robust defense mechanisms.

Contribution

The paper presents JAILJUDGE, a new benchmark with diverse risk scenarios and a multi-agent framework for explainable evaluation, along with novel attack and defense tools.

Findings

01

State-of-the-art performance of JailJudge methods across models and scenarios.

02

JailBoost improves attack performance by 29.24%.

03

GuardShield reduces defense ASR from 40.46% to 0.15%.

Abstract

Despite advancements in enhancing LLM safety against jailbreak attacks, evaluating LLM defenses remains a challenge, with current methods often lacking explainability and generalization to complex scenarios, leading to incomplete assessments (e.g., direct judgment without reasoning, low F1 score of GPT-4 in complex cases, bias in multilingual scenarios). To address this, we present JAILJUDGE, a comprehensive benchmark featuring diverse risk scenarios, including synthetic, adversarial, in-the-wild, and multilingual prompts, along with high-quality human-annotated datasets. The JAILJUDGE dataset includes over 35k+ instruction-tune data with reasoning explainability and JAILJUDGETEST, a 4.5k+ labeled set for risk scenarios, and a 6k+ multilingual set across ten languages. To enhance evaluation with explicit reasoning, we propose the JailJudge MultiAgent framework, which enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

usail-hkust/Jailjudge
pytorch

Models

🤗
usail-hkust/JailJudge-guard
model· 315 dl· ♡ 4
315 dl♡ 4

Datasets

usail-hkust/JailJudge
dataset· 46 dl
46 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Multi-Agent Systems and Negotiation · Digital and Cyber Forensics

MethodsAttention Is All You Need · Dense Connections · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Linear Layer · Softmax · Multi-Head Attention · Dropout