Building Better Deception Probes Using Targeted Instruction Pairs

Vikram Natarajan; Devina Jain; Shivam Arora; Satvik Golechha; Joseph Bloom

arXiv:2602.01425·cs.AI·February 3, 2026

Building Better Deception Probes Using Targeted Instruction Pairs

Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom

PDF

Open Access

TL;DR

This paper improves linear probes for detecting deception in AI by emphasizing the importance of targeted instruction pairs and a human-interpretable deception taxonomy, leading to more accurate and behavior-specific detection.

Contribution

It demonstrates that carefully designed instruction pairs and deception taxonomies enhance probe performance and highlights the need for organization-specific probes over universal detectors.

Findings

01

Instruction pairs capture deceptive intent rather than content.

02

Prompt choice explains 70.6% of performance variance.

03

Targeted probes outperform generic ones in deception detection.

Abstract

Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques