Benchmarking Deception Probes via Black-to-White Performance Boosts
Avi Parrack, Carlo Leonardo Attubato, Stefan Heimersheim

TL;DR
This paper evaluates deception probes in language models by comparing white-box and black-box detection methods, finding that white-box access significantly improves deception detection performance.
Contribution
It introduces a benchmarking approach using black-to-white performance boosts to assess deception probe effectiveness and resistance to evasion strategies.
Findings
White-box monitoring outperforms black-box monitoring in deception detection.
Existing deception probes show weak but promising black-to-white performance boosts.
The study highlights the importance of access level in deception detection effectiveness.
Abstract
AI assistants will occasionally respond deceptively to user queries. Recently, linear classifiers (called "deception probes") have been trained to distinguish the internal activations of a language model during deceptive versus honest responses. However, it's unclear how effective these probes are at detecting deception in practice, nor whether such probes are resistant to simple counter strategies from a deceptive assistant who wishes to evade detection. In this paper, we compare white-box monitoring (where the monitor has access to token-level probe activations) to black-box monitoring (without such access). We benchmark deception probes by the extent to which the white box monitor outperforms the black-box monitor, i.e. the black-to-white performance boost. We find weak but encouraging black-to-white performance boosts from existing deception probes.
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The experimental setup is thorough and well-designed. On top of many datasets tested, the inclusion of an adversarial attack ("Deception Dilution") and a control dataset (Alpaca) further strengthens the evaluation. - The paper produces several clear and actionable insights. The finding that a simple "Full Averaging" of probe activations (line 373) is an important result which suggests that complex reasoning over probe scores may not always be necessary or even beneficial. The paper is also t
1. Regarding results for hybrid reasoning models To me, the most concerning result is the performance of the White-box Reasoning (WBR) monitor on the Backdoors dataset. As shown in Figure 2, the WBR method performs significantly worse than the Black-box Reasoning (BBR) baseline. This is counterintuitive; adding a genuinely informative signal (the Pairs probe has an AUROC of 0.844 on this dataset with Full Averaging) should not make an LLM-based monitor worse. The paper's explanation (lines 417-4
- The proposed core evaluation metric—the "Black-to-White Performance Boost"—is a pragmatic and significant contribution. It shifts the evaluative focus from the "absolute accuracy of probes" to the more critical question of "what marginal benefit probes can offer in realistic monitoring scenarios." - The inclusion of stress tests for out-of-distribution (OOD) generalization (the `Backdoors` dataset) and adversarial robustness (the `Deception Dilution` attack) is crucial for evaluating the pract
While the paper's premise is strong, its core methodology and experimental design suffer from flaws that undermine the persuasiveness of its conclusions. - **Fundamental Flaw in Experimental Design:** The authors employ Claude 3.5 Sonnet as the reasoning engine for both BBR and WBR, to monitor a comparatively weaker LLaMA-3.3-70B assistant. In this asymmetric "strong-monitors-weak" setting, the BBR baseline (relying on Claude 3.5's black-box scrutiny alone) achieves exceptionally high performan
* The evaluation of linear probes for deception detection is an interesting and relevant and well motivated problem. * Evaluation on 4 reasonable benchmark datasets, the details of which are described clearly. * Robustness against countermeasures is a very interesting direction. * Figure is plotted with 95% bootstrap confidence intervals. Overall I think the research question tackled in the paper is promising.
* The clarity of the paper could be improved, especially the discussion of the methodology would benefit from a more formal and precise formulation. In its current state it was hard to follow and understand clearly how the monitoring methods are implemented. * While this is an interesting direction, I think the current evaluation is fairly limited by considering only a specific set of linear probes, and comparing to only a single model for black-box reasoning. I would have been very interested t
Strengths: The paper is exceptionally well written and easy to follow; it was a pleasure to read. It tackles an important and timely question with a simple, effective, and empirically grounded strategy: does advanced probe-based white-box monitoring on internal activations provide tangible benefits over more accessible black-box monitoring approaches? The authors present a detailed and systematic set of experiments that address this question convincingly—the results clearly show that while white
The selection of sonnet/haiku seems strange to me. I would like to see results with llama3.1 itself, then it would be a pure black-to-white comparison. I appreciate that the authors put a limitation in their paper, but I don't understand why it was chosen in the first place. There doesn't seem to be a reason to select stronger blackbox monitors than the whitebox counterparts. It would also be good to include other open-weight models to know if these kinds of results generalize. I am aware that
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Network Security and Intrusion Detection
