Building Better Deception Probes Using Targeted Instruction Pairs
Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom

TL;DR
This paper improves linear probes for detecting deception in AI by emphasizing the importance of targeted instruction pairs and a human-interpretable deception taxonomy, leading to more accurate and behavior-specific detection.
Contribution
It demonstrates that carefully designed instruction pairs and deception taxonomies enhance probe performance and highlights the need for organization-specific probes over universal detectors.
Findings
Instruction pairs capture deceptive intent rather than content.
Prompt choice explains 70.6% of performance variance.
Targeted probes outperform generic ones in deception detection.
Abstract
Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDeception detection and forensic psychology · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques
