Architecture Determines Observability of Transformers
Thomas Carmichael

TL;DR
This paper investigates how the architecture of autoregressive transformers influences their observability, showing that certain architectural properties enable better detection of confident errors beyond output confidence measures.
Contribution
It demonstrates that architecture determines the residual signal for monitoring, and that signal engineering during training can improve error detection beyond confidence-based methods.
Findings
Activation monitors depend on architecture-specific signals.
Controlling output confidence reduces activation-probe signals by 60.3%.
Downstream QA with probes detects about 12.5% of confident errors at 20% flag rate.
Abstract
Autoregressive transformers make confident errors that output-confidence monitoring cannot catch. Activation monitors catch them only when training leaves a decision-quality signal beyond what the output already exposes. This signal is an architectural property of the trained model, fixed upstream of any monitor. Controlling for output confidence removes 60.3% of the raw activation-probe signal on average across 14 models. Raw probe signal is mostly output confidence, and output-side readouts cannot recover the residual. What remains depends on architecture and training. In Pythia's controlled training, both matched-width configurations form the signal early. One preserves it through convergence while another erases it as perplexity continues to improve. Capability and observability are not inherently in tension. Across independently trained families this pattern persists, even as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
