Transcoder Adapters for Reasoning-Model Diffing

Nathan Hu; Jake Ward; Thomas Icard; Christopher Potts

arXiv:2602.20904·cs.LG·February 25, 2026

Transcoder Adapters for Reasoning-Model Diffing

Nathan Hu, Jake Ward, Thomas Icard, Christopher Potts

PDF

Open Access

TL;DR

This paper introduces transcoder adapters to interpret and analyze the internal changes in reasoning models after fine-tuning, revealing sparse, interpretable features responsible for reasoning behaviors like hesitation tokens.

Contribution

We propose transcoder adapters as a novel method to approximate and interpret the internal computation differences in reasoning models post-fine-tuning.

Findings

01

Adapters accurately reflect model internal computation and predictions.

02

Adapters recover 50-90% of accuracy gains from reasoning fine-tuning.

03

Only about 8% of adapter features are related to reasoning behaviors.

Abstract

While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model's internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model's internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model's response lengths and typically recover 50-90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling