"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Jiayi, Mao, Xueqi Cheng

TL;DR
This paper highlights that many LLM jailbreaks are hallucinations mistaken for safety breaches, and introduces BabyBLUE, a benchmark with evaluators and datasets to better assess genuine risks of harmful outputs.
Contribution
The paper introduces BabyBLUE, a specialized benchmark with evaluators and datasets to distinguish true jailbreak threats from hallucinations in LLM safety evaluations.
Findings
Many identified jailbreaks are hallucinations, not genuine threats.
BabyBLUE improves evaluation accuracy for LLM safety.
Enhanced benchmarks help develop better mitigation strategies.
Abstract
"Jailbreak" is a major safety concern of Large Language Models (LLMs), which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigation strategies. However, our research reveals that many jailbreaks identified by current evaluations may actually be hallucinations-erroneous outputs that are mistaken for genuine safety breaches. This finding suggests that some perceived vulnerabilities might not represent actual threats, indicating a need for more precise red teaming benchmarks. To address this problem, we propose the enchmark for reliilit and jailreak halcination valuation (BabyBLUE). BabyBLUE introduces a specialized validation framework including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Education
