Explaining the Reasoning of Large Language Models Using Attribution Graphs
Chase Walker, Rickard Ewetz

TL;DR
This paper introduces CAGE, a novel graph-based framework that enhances the explanation of large language models' reasoning by capturing inter-generational influences, significantly improving attribution faithfulness.
Contribution
The paper proposes the CAGE framework, which constructs attribution graphs to better explain LLM reasoning by considering influences from all prior generations, addressing limitations of existing methods.
Findings
CAGE improves attribution faithfulness by up to 40%.
It effectively captures inter-generational influences in LLMs.
The method is validated across multiple models and datasets.
Abstract
Large language models (LLMs) exhibit remarkable capabilities, yet their reasoning remains opaque, raising safety and trust concerns. Attribution methods, which assign credit to input features, have proven effective for explaining the decision making of computer vision models. From these, context attributions have emerged as a promising approach for explaining the behavior of autoregressive LLMs. However, current context attributions produce incomplete explanations by directly relating generated tokens to the prompt, discarding inter-generational influence in the process. To overcome these shortcomings, we introduce the Context Attribution via Graph Explanations (CAGE) framework. CAGE introduces an attribution graph: a directed graph that quantifies how each generation is influenced by both the prompt and all prior generations. The graph is constructed to preserve two…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Clear motivation and presentation. The paper articulates its motivation clearly and presents the research problem and solution in a well-structured and intuitive manner. 2. The empirical evaluations cover four models, three different task datasets, five base attribution methods, and multiple evaluation metrics. CAGE achieves the best performance in 85% of comparisons (17/20 on the AC metric, 40/40 on faithfulness), lending strong empirical support to its claims.
1. Limited technical novelty and theoretical depth. While the paper identifies clear shortcomings in row-wise attribution methods, the proposed CAGE framework primarily relies on standard graph-based computations (e.g., path accumulation operations). The approach appears straightforward. The authors could further clarify the technical or theoretical challenges involved in the research, and articulate any new conceptual insights it provides regarding the semantics of contextual attribution. 2. P
- Current "row attribution" methods treat each generated token's relationship to the prompt in isolation, which is fundamentally incompatible with the sequential, stateful nature of autoregressive generation, especially in chain-of-thought processes. The proposed CAGE framework is a novel and principled solution to this problem, utilizing a causal perspective on the generation process. - The paper is well-written and easy to follow. The motivation is clearly laid out in the introduction, and the
- The model of influence propagation, calculated via matrix multiplication ($A_{\tau,:}^{\prime}=A_{\tau,:}+A_{\tau,\tau-1}\cdot A_{\tau-1,:}$), implicitly assumes that influence propagates through the network in a way that can be modeled by a linear combination of path weights. The paper could be strengthened by acknowledging this as a simplifying assumption and briefly discussing why it is a reasonable one in this context, or contemplating what might be lost by this linearization of the influe
* Captures inter-generational influence (not just prompt→token), aligning with CoT behavior. * Principled construction (nonnegative, row-stochastic adjacency; DAG; closed-form total influence) * Consistent empirical gains (AC ↑ max/avg 134%/40%; faithfulness wins 40/40; up to 30%/11% improvement). * Method-agnostic: wraps perturbation, CLP, IG, Attn×IG, ReAGent.
* Because CAGE applies Φ(x)=max(x,0) and then row-normalizes the attribution table into a stochastic adjacency, it discards inhibitory (negative) effects and collapses absolute magnitudes into relative shares that must sum to one, so negative influence and true effect size cannot be represented. * The faithfulness tests remove entire prompt sentences and replace them with EOS tokens, which can introduce distribution shift and confound measured effects with artifacts of degraded input rather tha
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
