JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets
Zhihua Jin, Shiyi Liu, Haotian Li, Xun Zhao, and Huamin Qu

TL;DR
JailbreakHunter is a visual analytics system designed to detect private jailbreak prompts in large-scale human-LLM conversation datasets, helping to uncover evolving adversarial prompts that bypass safety measures.
Contribution
The paper introduces JailbreakHunter, a novel multi-level visual analytics workflow for identifying jailbreak prompts in extensive conversational data, addressing challenges of scale and diversity.
Findings
Effective in identifying private jailbreak prompts
Assists in understanding jailbreak strategies within conversations
Validated through case studies and expert feedback
Abstract
Large Language Models (LLMs) have gained significant attention but also raised concerns due to the risk of misuse. Jailbreak prompts, a popular type of adversarial attack towards LLMs, have appeared and constantly evolved to breach the safety protocols of LLMs. To address this issue, LLMs are regularly updated with safety patches based on reported jailbreak prompts. However, malicious users often keep their successful jailbreak prompts private to exploit LLMs. To uncover these private jailbreak prompts, extensive analysis of large-scale conversational datasets is necessary to identify prompts that still manage to bypass the system's defenses. This task is highly challenging due to the immense volume of conversation data, diverse characteristics of jailbreak prompts, and their presence in complex multi-turn conversations. To tackle these challenges, we introduce JailbreakHunter, a visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCrime Patterns and Interventions · Digital and Cyber Forensics · Crime, Deviance, and Social Control
MethodsSoftmax · Attention Is All You Need · Visual Analytics
