FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational   Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated   Multi-shot Jailbreaks)

Aman Priyanshu; Supriti Vijay

arXiv:2408.16163·cs.CL·November 8, 2024

FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks)

Aman Priyanshu, Supriti Vijay

PDF

Open Access 1 Datasets

TL;DR

This paper presents FRACTURED-SORRY-Bench, a framework for testing LLM safety against multi-turn conversational attacks, revealing vulnerabilities in current defenses through a novel adversarial prompt generation method.

Contribution

It introduces a new framework and method for generating adversarial prompts that expose weaknesses in LLM safety measures against multi-turn attacks.

Findings

01

Attack success rates increased by up to 46.22%

02

The method effectively challenges existing safety defenses

03

Highlights need for more robust LLM safety mechanisms

Abstract

This paper introduces FRACTURED-SORRY-Bench, a framework for evaluating the safety of Large Language Models (LLMs) against multi-turn conversational attacks. Building upon the SORRY-Bench dataset, we propose a simple yet effective method for generating adversarial prompts by breaking down harmful queries into seemingly innocuous sub-questions. Our approach achieves a maximum increase of +46.22\% in Attack Success Rates (ASRs) across GPT-4, GPT-4o, GPT-4o-mini, and GPT-3.5-Turbo models compared to baseline methods. We demonstrate that this technique poses a challenge to current LLM safety measures and highlights the need for more robust defenses against subtle, multi-turn attacks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AmanPriyanshu/FRACTURED-SORRY-Bench-Automated-Multishot-Jailbreak
dataset· 11 dl
11 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation and Cyber Security · Advanced Malware Detection Techniques · User Authentication and Security Systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Adam · Layer Normalization · Weight Decay · Position-Wise Feed-Forward Layer · Dense Connections · Attention Dropout · Cosine Annealing