Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
Shahar Katz, Yonatan Belinkov, Mor Geva, Lior Wolf

TL;DR
This paper introduces a novel method for projecting language model gradients into vocabulary space, providing insights into how models learn and store information during training.
Contribution
It extends interpretability techniques to the backward pass, demonstrating that gradients can be expressed as low-rank combinations of inputs, revealing new aspects of information flow.
Findings
Gradient matrices can be represented as low-rank combinations of inputs.
Projected gradients help understand how models encode new information.
Method reveals mechanisms of information storage in neurons.
Abstract
Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the models' vocabularies, helping to uncover how information flows within LMs. In this work, we extend this methodology to LMs' backward pass and gradients. We first prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes' inputs. We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs' neurons.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
