Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic
Yifan Zhou

TL;DR
This paper investigates how instruction tuning influences the interaction between earlier and later model layers, revealing that late layers depend on upstream states for behavior, with implications for interpretability and model control.
Contribution
Introduces first-divergence cross-patching to diagnose layer interactions, demonstrating how instruction tuning alters upstream and late stack cooperation in language models.
Findings
Late stack effects depend on own upstream state in instruction-tuned models.
Cross-patching reveals late layers' reliance on upstream states for behavior.
Sparse final-layer features mediate the influence of upstream patches.
Abstract
Recent interpretability work has identified model-internal handles on post-trained behavior, including refusal directions, assistant/persona axes, and sparse chat-tuning features. These results localize where behaviors can be read out or controlled, often in middle-to-late layers. We ask how earlier computation and the late stack cooperate to turn those differences into next-token margins. To test this, we introduce first-divergence cross-patching: at the first token where pretrained base (PT) and instruction-tuned (IT) checkpoints disagree, we cross each model's earlier-layer state with each model's late stack. The diagnostic separates training recipes: same-base instruction-following descendants show late effects that depend on their own earlier-layer state, while OpenMath2 math-domain SFT and controlled code/biomed CPT controls with verified domain learning do not; for OpenMath2, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
