Backward Lens: Projecting Language Model Gradients into the Vocabulary   Space

Shahar Katz; Yonatan Belinkov; Mor Geva; Lior Wolf

arXiv:2402.12865·cs.CL·February 21, 2024·1 cites

Backward Lens: Projecting Language Model Gradients into the Vocabulary Space

Shahar Katz, Yonatan Belinkov, Mor Geva, Lior Wolf

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel method for projecting language model gradients into vocabulary space, providing insights into how models learn and store information during training.

Contribution

It extends interpretability techniques to the backward pass, demonstrating that gradients can be expressed as low-rank combinations of inputs, revealing new aspects of information flow.

Findings

01

Gradient matrices can be represented as low-rank combinations of inputs.

02

Projected gradients help understand how models encode new information.

03

Method reveals mechanisms of information storage in neurons.

Abstract

Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the models' vocabularies, helping to uncover how information flows within LMs. In this work, we extend this methodology to LMs' backward pass and gradients. We first prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes' inputs. We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs' neurons.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Backward Lens: Projecting Language Model Gradients into the Vocabulary Space· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling