"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of   Large Language Models' Jailbreak

Lingrui Mei; Shenghua Liu; Yiwei Wang; Baolong Bi; Jiayi; Mao; Xueqi Cheng

arXiv:2406.11668·cs.CL·February 4, 2025

"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Jiayi, Mao, Xueqi Cheng

PDF

Open Access 1 Repo

TL;DR

This paper highlights that many LLM jailbreaks are hallucinations mistaken for safety breaches, and introduces BabyBLUE, a benchmark with evaluators and datasets to better assess genuine risks of harmful outputs.

Contribution

The paper introduces BabyBLUE, a specialized benchmark with evaluators and datasets to distinguish true jailbreak threats from hallucinations in LLM safety evaluations.

Findings

01

Many identified jailbreaks are hallucinations, not genuine threats.

02

BabyBLUE improves evaluation accuracy for LLM safety.

03

Enhanced benchmarks help develop better mitigation strategies.

Abstract

"Jailbreak" is a major safety concern of Large Language Models (LLMs), which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigation strategies. However, our research reveals that many jailbreaks identified by current evaluations may actually be hallucinations-erroneous outputs that are mistaken for genuine safety breaches. This finding suggests that some perceived vulnerabilities might not represent actual threats, indicating a need for more precise red teaming benchmarks. To address this problem, we propose the $B$ enchmark for reli $AB$ ilit $Y$ and jail $B$ reak ha $L$ l $U$ cination $E$ valuation (BabyBLUE). BabyBLUE introduces a specialized validation framework including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Meirtz/BabyBLUE-llm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Education