Constitutional Black-Box Monitoring for Scheming in LLM Agents
Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn

TL;DR
This paper explores the use of constitutional black-box language model monitors to detect scheming in LLM agents, demonstrating that synthetic data-based training can generalize to realistic environments but faces limitations due to overfitting.
Contribution
It introduces two synthetic data generation pipelines and evaluates prompt optimization strategies for LLM-based monitors in detecting agent scheming.
Findings
Monitors trained on synthetic data can generalize to realistic environments.
Prompt sweeps perform as well as extensive prompt optimization.
Performance saturates quickly, indicating overfitting risks.
Abstract
Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
