Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling
Erik Nordby, Tasha Pais, Aviel Parrack

TL;DR
This paper demonstrates that multi-layer ensembling of linear probes significantly improves detection of model deception across various model sizes, with accuracy scaling positively with model size.
Contribution
It introduces a multi-layer ensembling approach that overcomes the fragility of single-layer probes and explains the geometric rotation of deception directions across layers.
Findings
Ensembling probes from multiple layers recovers strong performance where single-layer probes fail.
Probe accuracy improves by approximately 5% AUROC per 10x increase in model parameters.
Deception directions rotate gradually across layers, affecting probe fragility and ensemble success.
Abstract
Linear probes can detect when language models produce outputs they "know" are wrong, a capability relevant to both deception and reward hacking. However, single-layer probes are fragile: the best layer varies across models and tasks, and probes fail entirely on some deception types. We show that combining probes from multiple layers into an ensemble recovers strong performance even where single-layer probes fail, improving AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge. Across 12 models (0.5B--176B parameters), we find probe accuracy improves with scale: ~5% AUROC per 10x parameters (R=0.81). Geometrically, deception directions rotate gradually across layers rather than appearing at one location, explaining both why single-layer probes are brittle and why multi-layer ensembles succeed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
