Improved Large Language Model Jailbreak Detection via Pretrained   Embeddings

Erick Galinkin; Martin Sablotny

arXiv:2412.01547·cs.CR·December 3, 2024

Improved Large Language Model Jailbreak Detection via Pretrained Embeddings

Erick Galinkin, Martin Sablotny

PDF

Open Access 1 Models

TL;DR

This paper introduces a new method for detecting jailbreak prompts in large language models by leveraging pretrained text embeddings and machine learning, significantly improving detection accuracy over existing open-source solutions.

Contribution

The paper presents a novel approach combining pretrained embeddings with classification algorithms to enhance jailbreak prompt detection in LLMs.

Findings

01

Outperforms existing open-source jailbreak detection methods

02

Effective pairing of text embeddings with classifiers improves accuracy

03

Provides a scalable solution for LLM security enhancement

Abstract

The adoption of large language models (LLMs) in many applications, from customer service chat bots and software development assistants to more capable agentic systems necessitates research into how to secure these systems. Attacks like prompt injection and jailbreaking attempt to elicit responses and actions from these models that are not compliant with the safety, privacy, or content policies of organizations using the model in their application. In order to counter abuse of LLMs for generating potentially harmful replies or taking undesirable actions, LLM owners must apply safeguards during training and integrate additional tools to block the LLM from generating text that abuses the model. Jailbreaking prompts play a vital role in convincing an LLM to generate potentially harmful content, making it important to identify jailbreaking attempts to block any further steps. In this work,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nvidia/NemoGuard-JailbreakDetect
model· 69 dl· ♡ 18
69 dl♡ 18

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Sexual Assault and Victimization Studies

Methodstravel james