JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models
Shuyi Liu, Simiao Cui, Haoran Bu, Yuming Shang, Xi Zhang

TL;DR
JailBench is a comprehensive Chinese benchmark designed to evaluate and expose deep-seated safety vulnerabilities in large language models, utilizing novel techniques to improve assessment effectiveness and scalability.
Contribution
The paper introduces JailBench, the first detailed Chinese-specific safety assessment benchmark for LLMs, with a novel framework for automatic dataset scaling and vulnerability detection.
Findings
Achieves highest attack success rate against ChatGPT among Chinese benchmarks.
Effectively exposes latent vulnerabilities in 13 mainstream LLMs.
Demonstrates substantial room for improving LLM safety in Chinese language applications.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across various applications, highlighting the urgent need for comprehensive safety evaluations. In particular, the enhanced Chinese language proficiency of LLMs, combined with the unique characteristics and complexity of Chinese expressions, has driven the emergence of Chinese-specific benchmarks for safety assessment. However, these benchmarks generally fall short in effectively exposing LLM safety vulnerabilities. To address the gap, we introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in LLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinese context. To improve generation efficiency, we employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Information and Cyber Security
