Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Zhao Xu, Fan Liu, Hao Liu

TL;DR
This paper introduces JailTrickBench, a comprehensive benchmarking framework for evaluating jailbreak attacks on LLMs, emphasizing the importance of standardized assessment across various attack settings and defense methods.
Contribution
The paper presents JailTrickBench, a new benchmark for systematically evaluating jailbreak attacks on LLMs, including diverse attack factors and defense scenarios, with extensive experimental validation.
Findings
Standardized benchmarking reveals vulnerabilities in defense-enhanced LLMs.
Evaluation of eight key attack factors across multiple datasets and defenses.
Approximately 354 experiments demonstrate the framework's effectiveness.
Abstract
Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we introduced to evaluate the impact of various attack settings on LLM performance and provide a baseline for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Cybercrime and Law Enforcement Studies
