Red-teaming Activation Probes using Prompted LLMs

Phil Blandfort; Robert Graham

arXiv:2511.00554·cs.LG·November 4, 2025

Red-teaming Activation Probes using Prompted LLMs

Phil Blandfort, Robert Graham

PDF

Open Access

TL;DR

This paper introduces a lightweight black-box red-teaming method using prompt-based feedback with LLMs to identify failure modes in activation probes, revealing vulnerabilities and brittleness patterns.

Contribution

It presents a novel, minimal-effort approach for probing AI systems' robustness without fine-tuning or access to model internals.

Findings

01

Identified brittleness patterns like false positives from legalese

02

Detected persistent vulnerabilities under scenario-constraint attacks

03

Provided insights for improving probe robustness

Abstract

Activation probes are attractive monitors for AI systems due to low cost and latency, but their real-world robustness remains underexplored. We ask: What failure modes arise under realistic, black-box adversarial pressure, and how can we surface them with minimal effort? We present a lightweight black-box red-teaming procedure that wraps an off-the-shelf LLM with iterative feedback and in-context learning (ICL), and requires no fine-tuning, gradients, or architectural access. Running a case study with probes for high-stakes interactions, we show that our approach can help discover valuable insights about a SOTA probe. Our analysis uncovers interpretable brittleness patterns (e.g., legalese-induced FPs; bland procedural tone FNs) and reduced but persistent vulnerabilities under scenario-constraint attacks. These results suggest that simple prompted red-teaming scaffolding can anticipate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques