Transcoder Adapters for Reasoning-Model Diffing
Nathan Hu, Jake Ward, Thomas Icard, Christopher Potts

TL;DR
This paper introduces transcoder adapters to interpret and analyze the internal changes in reasoning models after fine-tuning, revealing sparse, interpretable features responsible for reasoning behaviors like hesitation tokens.
Contribution
We propose transcoder adapters as a novel method to approximate and interpret the internal computation differences in reasoning models post-fine-tuning.
Findings
Adapters accurately reflect model internal computation and predictions.
Adapters recover 50-90% of accuracy gains from reasoning fine-tuning.
Only about 8% of adapter features are related to reasoning behaviors.
Abstract
While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model's internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model's internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model's response lengths and typically recover 50-90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling
