Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling

Erik Nordby; Tasha Pais; Aviel Parrack

arXiv:2604.13386·cs.LG·April 16, 2026

Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling

Erik Nordby, Tasha Pais, Aviel Parrack

PDF

TL;DR

This paper demonstrates that multi-layer ensembling of linear probes significantly improves detection of model deception across various model sizes, with accuracy scaling positively with model size.

Contribution

It introduces a multi-layer ensembling approach that overcomes the fragility of single-layer probes and explains the geometric rotation of deception directions across layers.

Findings

01

Ensembling probes from multiple layers recovers strong performance where single-layer probes fail.

02

Probe accuracy improves by approximately 5% AUROC per 10x increase in model parameters.

03

Deception directions rotate gradually across layers, affecting probe fragility and ensemble success.

Abstract

Linear probes can detect when language models produce outputs they "know" are wrong, a capability relevant to both deception and reward hacking. However, single-layer probes are fragile: the best layer varies across models and tasks, and probes fail entirely on some deception types. We show that combining probes from multiple layers into an ensemble recovers strong performance even where single-layer probes fail, improving AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge. Across 12 models (0.5B--176B parameters), we find probe accuracy improves with scale: ~5% AUROC per 10x parameters (R=0.81). Geometrically, deception directions rotate gradually across layers rather than appearing at one location, explaining both why single-layer probes are brittle and why multi-layer ensembles succeed.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.