Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
Mohammed Suhail B Nadaf

TL;DR
This paper reveals that function vectors can steer model behaviors without being decodable through traditional logit-based methods, challenging assumptions about linear representations in neural networks.
Contribution
It demonstrates that steerability and decodability are distinct properties, showing FV steering succeeds where decoding fails, with implications for model interpretability and safety.
Findings
FV steering often succeeds without decodable answers at intermediate layers.
Decoding and steering capabilities are largely independent, contradicting previous assumptions.
FV-based interventions can alter model behavior without leaving traceable logit-lens signals.
Abstract
Activation steering presupposes that task-relevant behaviors correspond to linear directions in activation space -- directions that should both steer the model and be readable along the unembedding. Function vectors (FVs), extracted as mean differences across ICL demonstrations, are the canonical test case; the prediction: steering and decoding succeed or fail together. Across 12 tasks, 6 models from 3 families, and 4,032 directed cross-template pairs, we find the opposite. FV steering routinely succeeds where the logit lens cannot decode the correct answer at any intermediate layer, while the converse -- decodable without steerable -- is nearly empty (3 of 72). The gap is not representational dialect. A diagonal tuned lens closes 1 of 14 steerable-not-decodable cases; a 2-layer MLP probe with a Hewitt \& Liang control closes 5 of 10 via nonlinearly encoded structure but leaves 5…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
