The Impact of Off-Policy Training Data on Probe Generalisation
Nathalie Kirch, Samuel Dower, Adrians Skapars, Helen Yannakoudakis, Ekdeep Singh Lubana, Dmitrii Krasheninnikov

TL;DR
This paper evaluates how off-policy training data affects the ability of probes to generalise across different large language model behaviours, highlighting challenges and proposing a predictive test for generalisation failures.
Contribution
It systematically studies the impact of off-policy data on probe generalisation and introduces a test to predict when probes may fail to generalise to on-policy scenarios.
Findings
Training data strategy significantly influences probe performance.
Probes struggle most with behaviours defined by response 'intent'.
Successful generalisation to incentivised data correlates with on-policy performance.
Abstract
Probing has emerged as a promising method for monitoring large language models (LLMs), enabling cheap inference-time detection of concerning behaviours. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for training probes. We systematically evaluate how off-policy data influences probe generalisation across eight distinct LLM behaviours. Testing linear and attention probes across multiple LLMs, we find that training data generation strategy can significantly affect probe performance, though the magnitude varies greatly by behaviour. The largest generalisation failures arise for behaviours defined by response ``intent'' (e.g., strategic deception) rather than text-level content (e.g., usage of lists). We then propose a useful test for predicting generalisation failures in cases where on-policy test data is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
