Dynamic Weight Grafting: Localizing Finetuned Factual Knowledge in Transformers
Todd Nief, David Reber, Sean Richardson, Ari Holtzman

TL;DR
This paper introduces dynamic weight grafting, a new interpretability method that localizes where and how finetuned factual knowledge is stored and retrieved in transformer models, revealing multiple pathways for knowledge access.
Contribution
It proposes dynamic weight grafting to analyze finetuned knowledge localization, identifying enrichment and recall pathways and their mechanisms within transformer models.
Findings
Knowledge is stored via enrichment and recall pathways.
Recall involves task-specific attention and entity extraction.
Model components responsible for knowledge retrieval are localized.
Abstract
When an LLM learns a new fact during finetuning (e.g., new movie releases, newly elected pope, etc.), where does this information go? Are entities enriched with relation information immediately, or do models recall information just-in-time before a prediction? Or, are "all of the above" true, with LLMs implementing multiple redundant heuristics? Existing localization approaches (e.g., activation patching) are ill-suited for this analysis because they usually replace parts of the residual stream, thus overriding previous information. To fill this interpretability gap, we propose dynamic weight grafting, an analysis technique that selectively grafts subsets of weights from a finetuned model onto a pretrained model. Using this technique, we show two separate pathways for retrieving finetuned relation information: 1) "enriching" the residual stream with relation information while processing…
Peer Reviews
Decision·ICLR 2026 Poster
- Novel methodology: DWG is an original and potentially useful interpretability technique that avoids the destructive limitations of activation patching. - Comprehensive experiments: Multiple models and datasets are tested, showing consistency of findings. - Clear conceptual framing: The paper offers an intuitive distinction between “enrichment” and “recall” processes in LLM memory retrieval. - Potential for broader application: The approach could generalize to other interpretability or knowledg
- Synthetic and simplistic setting: The experiments rely heavily on artificial datasets (e.g., fake movies and actors), which limits external validity. - Limited theoretical insight: The method identifies where retrieval occurs but not why or how specific mechanisms encode relations. - Insufficient analysis of failures: Cases where grafting fails (e.g., certain models or directions) are mentioned but not deeply analyzed.
- The paper proposes an innovative way to probe how relational knowledge is retrieved from parameters without directly overriding residual stream, like activation patching. - Datasets and experiments are tightly controlled, yielding strong, clean and strong evidence that enrichment at the first token and recall at the last token jointly recover fine-tuning performance. - Component grafting reveals insights for how late-layer O-projection + FFN at the final token and task-specific attention at th
- It is not clear if similar mechanism can generalize to real-world data and if the findings still hold as model size scales. - The study is conducted on one-hop relational task; findings for position and component grafting may not transfer to multi-hop settings. The claims should be tempered accordingly to the scope of the task.
**Novel and Sound Methodology**: The core contribution, "dynamic weight grafting," is a clever and valuable addition to the interpretability toolkit. It correctly identifies a key flaw in standard activation patching (conflating information computation with information passing) and proposes a more precise causal intervention by swapping the mechanisms (weights) themselves. **Clear, Rigorous Findings**: The identification of the 'E' and 'R' pathways is a clear and compelling finding. The authors
## Experimental Concerns 1. **Unexplained Model-Specific Strategies**: A significant concern is the inconsistent behavior across models, which the paper notes but fails to adequately explain. In Figure 2, the performance of LT (Last Token) versus FE^C (which includes LT) shows large, unexplained disparities in models like GPT2-XL. More importantly, the paper notes that models like Gemma/Llama favor a strong 'R' path, while GPT2-XL/Pythia favor an 'E' path. For an interpretability paper, why the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
