Gradient Co-occurrence Analysis for Detecting Unsafe Prompts in Large Language Models
Jingyuan Yang, Bowen Yan, Rongjun Li, Ziyu Zhou, Xin Chen, Zhiyong, Feng, Wei Peng

TL;DR
GradCoo is a novel gradient co-occurrence analysis method that improves unsafe prompt detection in large language models by reducing directional bias and achieving state-of-the-art performance across multiple benchmarks and models.
Contribution
Introduces GradCoo, a new gradient analysis technique that expands safety-critical parameter detection to include unsigned similarity, overcoming directional bias in unsafe prompt detection.
Findings
Achieves state-of-the-art performance on ToxicChat and XStest datasets.
Demonstrates effective unsafe prompt detection across various LLM sizes and types.
Reduces directional bias, improving detection accuracy.
Abstract
Unsafe prompts pose significant safety risks to large language models (LLMs). Existing methods for detecting unsafe prompts rely on data-driven fine-tuning to train guardrail models, necessitating significant data and computational resources. In contrast, recent few-shot gradient-based methods emerge, requiring only few safe and unsafe reference prompts. A gradient-based approach identifies unsafe prompts by analyzing consistent patterns of the gradients of safety-critical parameters in LLMs. Although effective, its restriction to directional similarity (cosine similarity) introduces ``directional bias'', limiting its capability to identify unsafe prompts. To overcome this limitation, we introduce GradCoo, a novel gradient co-occurrence analysis method that expands the scope of safety-critical parameter identification to include unsigned gradient similarity, thereby reducing the impact…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
