Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations

Sanjay Basu; Sadiq Y. Patel; Parth Sheth; Bhairavi Muralidharan; Namrata Elamaran; Aakriti Kinra; John Morgan; Rajaie Batniji

arXiv:2603.18353·cs.AI·March 20, 2026

Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations

Sanjay Basu, Sadiq Y. Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, Rajaie Batniji

PDF

Open Access

TL;DR

This study evaluates whether mechanistic interpretability methods can effectively correct language model errors, finding that despite encoding task knowledge, these methods fail to reliably translate internal representations into accurate outputs.

Contribution

The paper systematically tests four mechanistic interpretability methods for error correction, revealing their limitations in translating internal knowledge into improved model outputs.

Findings

01

Linear probes achieve 98.2% AUROC but only 45.1% output sensitivity.

02

Concept bottleneck steering corrects 20% of hazards but disrupts 53% of correct detections.

03

Current interpretability methods cannot reliably convert internal knowledge into correct outputs.

Abstract

Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge-action gap has not been systematically tested. We compared four mechanistic interpretability methods -- concept bottleneck steering (Steerling-8B), sparse autoencoder feature steering, logit lens with activation patching, and linear probing with truthfulness separator vector steering (Qwen 2.5 7B Instruct) -- for correcting false-negative triage errors using 400 physician-adjudicated clinical vignettes (144 hazards, 256 benign). Linear probes discriminated hazardous from benign cases with 98.2% AUROC, yet the model's output sensitivity was only 45.1%, a 53-percentage-point knowledge-action gap. Concept bottleneck steering corrected 20% of missed hazards but disrupted 53% of correct detections,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Clinical Reasoning and Diagnostic Skills