Reasoning-Finetuning Repurposes Latent Representations in Base Models
Jake Ward, Chuqiao Lin, Constantin Venhoff, Neel Nanda

TL;DR
This paper reveals that reasoning fine-tuning repurposes existing latent representations in base models to induce backtracking behavior, rather than learning new capabilities from scratch, enhancing understanding of model interpretability.
Contribution
It identifies a specific direction in base model activations that, when used for steering, induces backtracking in reasoning models, showing repurposing of pre-existing representations.
Findings
A direction in base model activations systematically induces backtracking.
Steering with this direction does not trigger backtracking in the base model.
Reasoning fine-tuning repurposes existing representations rather than creating new ones.
Abstract
Backtracking, an emergent behavior elicited by reasoning fine-tuning, has been shown to be a key mechanism in reasoning models' enhanced capabilities. Prior work has succeeded in manipulating this behavior via steering vectors, but the underlying mechanism remains poorly understood. In this work, we show that the emergence of backtracking in DeepSeek-R1-Distill-Llama-8B is in part driven by a repurposed direction already present in base model activations. Specifically, we identify a direction in base Llama-3.1-8B's residual stream which systematically induces backtracking when used to steer the distilled reasoning model, and find that the effects of steering with this direction cannot be trivially explained by token-level attributes. We further find that this direction does not induce backtracking in the base model, suggesting that the reasoning finetuning process repurposes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling
