Gradient Co-occurrence Analysis for Detecting Unsafe Prompts in Large   Language Models

Jingyuan Yang; Bowen Yan; Rongjun Li; Ziyu Zhou; Xin Chen; Zhiyong; Feng; Wei Peng

arXiv:2502.12411·cs.CL·February 19, 2025

Gradient Co-occurrence Analysis for Detecting Unsafe Prompts in Large Language Models

Jingyuan Yang, Bowen Yan, Rongjun Li, Ziyu Zhou, Xin Chen, Zhiyong, Feng, Wei Peng

PDF

Open Access

TL;DR

GradCoo is a novel gradient co-occurrence analysis method that improves unsafe prompt detection in large language models by reducing directional bias and achieving state-of-the-art performance across multiple benchmarks and models.

Contribution

Introduces GradCoo, a new gradient analysis technique that expands safety-critical parameter detection to include unsigned similarity, overcoming directional bias in unsafe prompt detection.

Findings

01

Achieves state-of-the-art performance on ToxicChat and XStest datasets.

02

Demonstrates effective unsafe prompt detection across various LLM sizes and types.

03

Reduces directional bias, improving detection accuracy.

Abstract

Unsafe prompts pose significant safety risks to large language models (LLMs). Existing methods for detecting unsafe prompts rely on data-driven fine-tuning to train guardrail models, necessitating significant data and computational resources. In contrast, recent few-shot gradient-based methods emerge, requiring only few safe and unsafe reference prompts. A gradient-based approach identifies unsafe prompts by analyzing consistent patterns of the gradients of safety-critical parameters in LLMs. Although effective, its restriction to directional similarity (cosine similarity) introduces ``directional bias'', limiting its capability to identify unsafe prompts. To overcome this limitation, we introduce GradCoo, a novel gradient co-occurrence analysis method that expands the scope of safety-critical parameter identification to include unsigned gradient similarity, thereby reducing the impact…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling