Improved Large Language Model Jailbreak Detection via Pretrained Embeddings
Erick Galinkin, Martin Sablotny

TL;DR
This paper introduces a new method for detecting jailbreak prompts in large language models by leveraging pretrained text embeddings and machine learning, significantly improving detection accuracy over existing open-source solutions.
Contribution
The paper presents a novel approach combining pretrained embeddings with classification algorithms to enhance jailbreak prompt detection in LLMs.
Findings
Outperforms existing open-source jailbreak detection methods
Effective pairing of text embeddings with classifiers improves accuracy
Provides a scalable solution for LLM security enhancement
Abstract
The adoption of large language models (LLMs) in many applications, from customer service chat bots and software development assistants to more capable agentic systems necessitates research into how to secure these systems. Attacks like prompt injection and jailbreaking attempt to elicit responses and actions from these models that are not compliant with the safety, privacy, or content policies of organizations using the model in their application. In order to counter abuse of LLMs for generating potentially harmful replies or taking undesirable actions, LLM owners must apply safeguards during training and integrate additional tools to block the LLM from generating text that abuses the model. Jailbreaking prompts play a vital role in convincing an LLM to generate potentially harmful content, making it important to identify jailbreaking attempts to block any further steps. In this work,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Sexual Assault and Victimization Studies
Methodstravel james
