"Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks
Libo Wang

TL;DR
This paper evaluates the robustness of guardrails in several large language models against multi-step jailbreak prompts designed to induce harmful content, revealing vulnerabilities and highlighting Claude 3.5 Sonnet's relative resistance.
Contribution
It introduces a black-box testing framework for assessing guardrail effectiveness in LLMs using multi-step ethical attack prompts, and provides empirical results on model vulnerabilities.
Findings
Most models' guardrails were bypassed by multi-step prompts.
Claude 3.5 Sonnet showed stronger resistance to jailbreak prompts.
The study provides open-source tools for further testing.
Abstract
As the application of large language models continues to expand in various fields, it poses higher challenges to the effectiveness of identifying harmful content generation and guardrail mechanisms. This research aims to evaluate the guardrail effectiveness of GPT-4o, Grok-2 Beta, Llama 3.1 (405B), Gemini 1.5, and Claude 3.5 Sonnet through black-box testing of seemingly ethical multi-step jailbreak prompts. It conducts ethical attacks by designing an identical multi-step prompts that simulates the scenario of "corporate middle managers competing for promotions." The data results show that the guardrails of the above-mentioned LLMs were bypassed and the content of verbal attacks was generated. Claude 3.5 Sonnet's resistance to multi-step jailbreak prompts is more obvious. To ensure objectivity, the experimental process, black box test code, and enhanced guardrail code are uploaded to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
MethodsLLaMA
