"Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of   Guardrails in Large Language Models for Verbal Attacks

Libo Wang

arXiv:2411.16730·cs.CR·March 21, 2025

"Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks

Libo Wang

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the robustness of guardrails in several large language models against multi-step jailbreak prompts designed to induce harmful content, revealing vulnerabilities and highlighting Claude 3.5 Sonnet's relative resistance.

Contribution

It introduces a black-box testing framework for assessing guardrail effectiveness in LLMs using multi-step ethical attack prompts, and provides empirical results on model vulnerabilities.

Findings

01

Most models' guardrails were bypassed by multi-step prompts.

02

Claude 3.5 Sonnet showed stronger resistance to jailbreak prompts.

03

The study provides open-source tools for further testing.

Abstract

As the application of large language models continues to expand in various fields, it poses higher challenges to the effectiveness of identifying harmful content generation and guardrail mechanisms. This research aims to evaluate the guardrail effectiveness of GPT-4o, Grok-2 Beta, Llama 3.1 (405B), Gemini 1.5, and Claude 3.5 Sonnet through black-box testing of seemingly ethical multi-step jailbreak prompts. It conducts ethical attacks by designing an identical multi-step prompts that simulates the scenario of "corporate middle managers competing for promotions." The data results show that the guardrails of the above-mentioned LLMs were bypassed and the content of verbal attacks was generated. Claude 3.5 Sonnet's resistance to multi-step jailbreak prompts is more obvious. To ensure objectivity, the experimental process, black box test code, and enhanced guardrail code are uploaded to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

brucewang123456789/GeniusTrail
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsLLaMA