Architecture Determines Observability of Transformers

Thomas Carmichael

arXiv:2604.24801·cs.LG·May 13, 2026

Architecture Determines Observability of Transformers

Thomas Carmichael

PDF

TL;DR

This paper investigates how the architecture of autoregressive transformers influences their observability, showing that certain architectural properties enable better detection of confident errors beyond output confidence measures.

Contribution

It demonstrates that architecture determines the residual signal for monitoring, and that signal engineering during training can improve error detection beyond confidence-based methods.

Findings

01

Activation monitors depend on architecture-specific signals.

02

Controlling output confidence reduces activation-probe signals by 60.3%.

03

Downstream QA with probes detects about 12.5% of confident errors at 20% flag rate.

Abstract

Autoregressive transformers make confident errors that output-confidence monitoring cannot catch. Activation monitors catch them only when training leaves a decision-quality signal beyond what the output already exposes. This signal is an architectural property of the trained model, fixed upstream of any monitor. Controlling for output confidence removes 60.3% of the raw activation-probe signal on average across 14 models. Raw probe signal is mostly output confidence, and output-side readouts cannot recover the residual. What remains depends on architecture and training. In Pythia's controlled training, both matched-width configurations form the signal early. One preserves it through convergence while another erases it as perplexity continues to improve. Capability and observability are not inherently in tension. Across independently trained families this pattern persists, even as the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.