TL;DR
GAVEL introduces a rule-based, interpretable activation monitoring framework for LLM safety, enabling real-time violation detection, domain customization, and improved precision without retraining.
Contribution
It proposes modeling activations as cognitive elements and defines predicate rules for scalable, transparent, and customizable safety monitoring in large language models.
Findings
Improved precision over existing activation safety methods
Supports domain-specific customization without retraining
Provides an open-source tool for rule authoring and management
Abstract
Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as 'making a threat' and 'payment processing', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
