Mechanistic Interpretability of GPT-like Models on Summarization Tasks
Anurag Mishra

TL;DR
This paper develops a mechanistic interpretability framework for GPT-like models on summarization tasks, identifying key layers and attention heads involved in the process, and improving performance with targeted LoRA adaptation.
Contribution
It introduces a novel interpretability approach for summarization, locating the 'summarization circuit' within GPT-like models and enhancing fine-tuning efficiency.
Findings
Middle layers 2, 3, and 5 show significant changes during summarization.
62% of attention heads decrease entropy, indicating focused information selection.
Targeted LoRA adaptation outperforms standard fine-tuning with fewer epochs.
Abstract
Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
