TokenSHAP: Interpreting Large Language Models with Monte Carlo Shapley Value Estimation
Roni Goldshmidt, Miriam Horovicz

TL;DR
TokenSHAP introduces a Monte Carlo-based Shapley value method for interpreting large language models by quantifying token importance, improving transparency, and aiding prompt engineering.
Contribution
It adapts Shapley values with Monte Carlo sampling for efficient, nuanced interpretation of LLMs, advancing explainability in NLP.
Findings
Outperforms existing baselines in alignment with human judgments
Provides consistent, faithful token importance measures
Enhances understanding of token interactions in LLMs
Abstract
As large language models (LLMs) become increasingly prevalent in critical applications, the need for interpretable AI has grown. We introduce TokenSHAP, a novel method for interpreting LLMs by attributing importance to individual tokens or substrings within input prompts. This approach adapts Shapley values from cooperative game theory to natural language processing, offering a rigorous framework for understanding how different parts of an input contribute to a model's response. TokenSHAP leverages Monte Carlo sampling for computational efficiency, providing interpretable, quantitative measures of token importance. We demonstrate its efficacy across diverse prompts and LLM architectures, showing consistent improvements over existing baselines in alignment with human judgments, faithfulness to model behavior, and consistency. Our method's ability to capture nuanced interactions between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
