GAVEL: Towards Rule-Based Safety Through Activation Monitoring

Shir Rozenfeld; Rahul Pankajakshan; Itay Zloczower; Eyal Lenga; Gilad Gressel; and Yisroel Mirsky

arXiv:2601.19768·cs.AI·May 1, 2026

GAVEL: Towards Rule-Based Safety Through Activation Monitoring

Shir Rozenfeld, Rahul Pankajakshan, Itay Zloczower, Eyal Lenga, Gilad Gressel, and Yisroel Mirsky

PDF

1 Repo 1 Video

TL;DR

GAVEL introduces a rule-based, interpretable activation monitoring framework for LLM safety, enabling real-time violation detection, domain customization, and improved precision without retraining.

Contribution

It proposes modeling activations as cognitive elements and defines predicate rules for scalable, transparent, and customizable safety monitoring in large language models.

Findings

01

Improved precision over existing activation safety methods

02

Supports domain-specific customization without retraining

03

Provides an open-source tool for rule authoring and management

Abstract

Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as 'making a threat' and 'payment processing', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Offensive-AI-Lab/gavel
github

Videos

GAVEL: Towards Rule-Based Safety through Activation Monitoring· slideslive