Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Ming Liu

TL;DR
This paper investigates whether linearly decodable failure signals in medical LLMs can be used to correct errors, finding that while failure signals are decodable, fixed linear steering does not improve accuracy, but can support reliability estimation.
Contribution
The study reveals the limitations of fixed linear steering in correcting LLM failures despite decodable signals, highlighting representational entanglement and proposing reliability estimation methods.
Findings
Failure signals are linearly decodable at 71.6% accuracy.
Fixed linear steering yields null correction results across models and domains.
Decodable failure signals can be used for reliability estimation, not correction.
Abstract
Can linearly decodable failure signals in LLM hidden states be leveraged to correct those failures? We investigate this classification-correction gap via Overthinking (OT)--a stable behavioral regime (Jaccard >= 0.81, 94% inter-annotator agreement) in medical QA where models answer correctly under resampling yet fail in extended chain-of-thought. OT is linearly decodable at 71.6% balanced accuracy (p < 10^{-16}). Yet five families of fixed linear steering (29 configurations, n=1,273) all yield Delta ~= 0, with identical null results cross-architecture (Qwen2.5-7B) and cross-domain (MMLU-STEM). Three convergent lines of evidence suggest representational entanglement: the OT direction has 85-88% overlap with task-critical computation (specificity ratio <= 0.152); non-targeted shared-direction steering damages accuracy (-12.1pp); and LEACE concept erasure damages accuracy (-3.6pp, p=0.01),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
