Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie; Urja Pawar; Phil Blandfort; William Bankes; David Krueger; Ekdeep Singh Lubana; Dmitrii Krasheninnikov

arXiv:2506.10805·cs.LG·January 26, 2026

Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, Dmitrii Krasheninnikov

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper explores activation probes for detecting high-stakes interactions in large language models, demonstrating their robustness, efficiency, and potential for hierarchical monitoring systems to improve safety and resource management.

Contribution

It introduces a new approach using activation probes trained on synthetic data, showing they generalize well and offer significant computational savings for monitoring LLMs.

Findings

01

Probes perform comparably to finetuned monitors.

02

Probes generalize to out-of-distribution data.

03

Significant computational savings achieved.

Abstract

Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting ``high-stakes'' interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. These savings are enabled by reusing activations of the model that is being monitored. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Arrrlex/models-under-pressure
dataset· 113 dl
113 dl

Videos

Detecting High-Stakes Interactions with Activation Probes· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)