DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents
Snehasis Mukhopadhyay

TL;DR
This paper introduces DECEPTGUARD, a comprehensive framework for detecting deception in Large Language Model agents by comparing different monitoring regimes and leveraging synthetic data, significantly improving detection accuracy especially for subtle deception.
Contribution
The paper presents DECEPTGUARD, a unified framework that combines multiple monitoring strategies and a synthetic data pipeline to enhance deception detection in LLM agents.
Findings
CoT-aware and activation-probe monitors outperform black-box monitors.
Hybrid ensembles achieve high detection performance with pAUROC of 0.934.
Detection effectiveness decreases as agents suppress overt behavioral signals.
Abstract
Reliable detection of deceptive behavior in Large Language Model (LLM) agents is an essential prerequisite for safe deployment in high-stakes agentic contexts. Prior work on scheming detection has focused exclusively on black-box monitors that observe only externally visible tool calls and outputs, discarding potentially rich internal reasoning signals. We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace), and activation-probe monitors (additionally reading hidden-state representations from a frozen open-weights encoder). We introduce DECEPTSYNTH, a scalable synthetic pipeline for generating deception-positive and deception-negative agent trajectories across a novel 12-category taxonomy spanning verbal,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDeception detection and forensic psychology · Topic Modeling · Explainable Artificial Intelligence (XAI)
