Simple Mechanistic Explanations for Out-Of-Context Reasoning
Atticus Wang, Joshua Engels, Oliver Clive-Griffin, Senthooran Rajamanoharan, Neel Nanda

TL;DR
This paper explains how out-of-context reasoning in fine-tuned LLMs can be attributed to simple mechanisms like steering vectors, revealing the underlying process behind their surprising generalization abilities.
Contribution
It demonstrates that many instances of OOCR are due to LoRA fine-tuning adding a constant steering vector, providing a mechanistic explanation for this phenomenon.
Findings
Steering vectors induce OOCR in fine-tuned models
Adding steering vectors from scratch can replicate OOCR
Unconditional steering explains behavior previously thought to require conditional logic
Abstract
Out-of-context reasoning (OOCR) is a phenomenon in which fine-tuned LLMs exhibit surprisingly deep out-of-distribution generalization. Rather than learning shallow heuristics, they implicitly internalize and act on the consequences of observations scattered throughout the fine-tuning data. In this work, we investigate this phenomenon mechanistically and find that many instances of OOCR in the literature have a simple explanation: the LoRA fine-tuning essentially adds a constant steering vector, steering the model towards a general concept. This improves performance on the fine-tuning task and in many other concept-related domains, causing the surprising generalization. Moreover, we can directly train steering vectors for these tasks from scratch, which also induces OOCR. We find that our results hold even for a task that seems like it must involve conditional behavior (model backdoors);…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications
