Single-pass Detection of Jailbreaking Input in Large Language Models

Leyla Naz Candogan; Yongtao Wu; Elias Abad Rocamora; Grigorios G.; Chrysos; Volkan Cevher

arXiv:2502.15435·cs.LG·February 24, 2025

Single-pass Detection of Jailbreaking Input in Large Language Models

Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios G., Chrysos, Volkan Cevher

PDF

TL;DR

This paper introduces Single Pass Detection (SPD), a method for efficiently identifying jailbreaking inputs in large language models during a single forward pass, enhancing security without heavy computational costs.

Contribution

The paper presents SPD, a novel single-pass detection technique that leverages logits to identify harmful inputs, effective on open-source and proprietary models even with limited logit access.

Findings

01

SPD effectively detects jailbreaking attacks on open-source models.

02

SPD minimizes false positives on harmless inputs.

03

SPD remains effective with partial logit access in GPT-3.5 and GPT-4.

Abstract

Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Absolute Position Encodings · Dense Connections · Attention Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection