Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
Shixing Yu, Promit Ghosal, Kyra Gan

TL;DR
This paper presents a novel framework for attributing influence in large language models at the token level, using latent spaces and Jacobian methods to improve interpretability and trust in high-stakes applications.
Contribution
It introduces a flexible, latent mediation approach with autoencoders and Jacobian-vector products for token influence attribution in general prediction tasks.
Findings
Identifies sparse, interpretable token influences on predictions.
Enhances model transparency and trust in healthcare benchmarks.
Scales efficiently with inverse-Hessian approximations.
Abstract
A critical step for reliable large language models (LLMs) use in healthcare is to attribute predictions to their training data, akin to a medical case study. This requires token-level precision: pinpointing not just which training examples influence a decision, but which tokens within them are responsible. While influence functions offer a principled framework for this, prior work is restricted to autoregressive settings and relies on an implicit assumption of token independence, rendering their identified influences unreliable. We introduce a flexible framework that infers token-level influence through a latent mediation approach for general prediction tasks. Our method attaches sparse autoencoders to any layer of a pretrained LLM to learn a basis of approximately independent latent features. Unlike prior methods where influence decomposes additively across tokens, influence computed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
