Grad-ELLM: Gradient-based Explanations for Decoder-only LLMs
Xin Huang, Antoni B. Chan

TL;DR
Grad-ELLM introduces a gradient-based attribution method tailored for decoder-only transformer LLMs, improving faithfulness of input explanations without architectural changes.
Contribution
It proposes Grad-ELLM, a novel gradient-based attribution technique for decoder-only LLMs, along with new faithfulness metrics for better evaluation.
Findings
Grad-ELLM outperforms existing attribution methods in faithfulness.
The method works across multiple tasks and models.
New metrics enable fairer comparison of attribution methods.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their black-box nature raises concerns about transparency and faithfulness. Input attribution methods aim to highlight each input token's contributions to the model's output, but existing approaches are typically model-agnostic, and do not focus on transformer-specific architectures, leading to limited faithfulness. To address this, we propose Grad-ELLM, a gradient-based attribution method for decoder-only transformer-based LLMs. By aggregating channel importance from gradients of the output logit with respect to attention layers and spatial importance from attention maps, Grad-ELLM generates heatmaps at each generation step without requiring architectural modifications. Additionally, we introduce two faithfulneses metrics -Soft-NC and -Soft-NS, which are modifications of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Computational and Text Analysis Methods
