SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu

TL;DR
SelfGrader is a novel, low-latency method for detecting jailbreak attacks on large language models by analyzing token-level logits to produce stable safety scores.
Contribution
It introduces a token-level logit-based scoring system with a dual-perspective rule, improving detection stability and reducing false positives.
Findings
Achieves up to 22.66% reduction in attack success rate on LLaMA-3-8B.
Maintains significantly lower memory overhead and latency compared to baselines.
Demonstrates effectiveness across diverse benchmarks and models.
Abstract
Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from the randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using token-level logits. Specifically, SelfGrader evaluates the safety of a user query within a compact set of numerical tokens (NTs) (e.g., 0-9) and interprets their logit distribution as an internal safety signal. To align these signals with human intuition of maliciousness, SelfGrader introduces a dual-perspective scoring rule that considers both the maliciousness and benignness of the query, yielding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
