Jacobian Scopes: token-level causal attributions in LLMs
Toni J.B. Liu, Baran Zadeo\u{g}lu, Nicolas Boull\'e, Rapha\"el Sarfati, Christopher J. Earls

TL;DR
Jacobian Scopes introduces gradient-based, token-level causal attribution methods for interpreting large language models, revealing how individual input tokens influence predictions and uncovering biases and mechanisms across various NLP tasks.
Contribution
The paper presents Jacobian Scopes, a novel suite of gradient-based methods for token-level causal attribution in LLMs, grounded in perturbation theory and information geometry.
Findings
Reveals implicit political biases in LLM predictions.
Uncovers word- and phrase-level translation strategies.
Provides insights into mechanisms of in-context learning.
Abstract
Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. Grounded in perturbation theory and information geometry, Jacobian Scopes quantify how input tokens influence various aspects of a model's prediction, such as specific logits, the full predictive distribution, and model uncertainty (effective temperature). Through case studies spanning instruction understanding, translation, and in-context learning (ICL), we demonstrate how Jacobian Scopes reveal implicit political biases, uncover…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Computational and Text Analysis Methods · Multimodal Machine Learning Applications
