JailBench: A Comprehensive Chinese Security Assessment Benchmark for   Large Language Models

Shuyi Liu; Simiao Cui; Haoran Bu; Yuming Shang; Xi Zhang

arXiv:2502.18935·cs.CL·February 27, 2025

JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models

Shuyi Liu, Simiao Cui, Haoran Bu, Yuming Shang, Xi Zhang

PDF

Open Access 1 Models

TL;DR

JailBench is a comprehensive Chinese benchmark designed to evaluate and expose deep-seated safety vulnerabilities in large language models, utilizing novel techniques to improve assessment effectiveness and scalability.

Contribution

The paper introduces JailBench, the first detailed Chinese-specific safety assessment benchmark for LLMs, with a novel framework for automatic dataset scaling and vulnerability detection.

Findings

01

Achieves highest attack success rate against ChatGPT among Chinese benchmarks.

02

Effectively exposes latent vulnerabilities in 13 mainstream LLMs.

03

Demonstrates substantial room for improving LLM safety in Chinese language applications.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various applications, highlighting the urgent need for comprehensive safety evaluations. In particular, the enhanced Chinese language proficiency of LLMs, combined with the unique characteristics and complexity of Chinese expressions, has driven the emergence of Chinese-specific benchmarks for safety assessment. However, these benchmarks generally fall short in effectively exposing LLM safety vulnerabilities. To address the gap, we introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in LLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinese context. To improve generation efficiency, we employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
hfuserh/LLaMA-3.1-8B-JailbreakSafe
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Information and Cyber Security