Applied Explainability for Large Language Models: A Comparative Study
Venkata Abhinandan Kancharla

TL;DR
This study compares three explainability techniques for LLMs, revealing their strengths, limitations, and trade-offs in practical NLP applications, especially for sentiment analysis with DistilBERT.
Contribution
It provides a systematic evaluation of existing explainability methods, offering practical insights into their behavior and trade-offs in real-world NLP tasks.
Findings
Gradient-based attribution is more stable and intuitive.
Attention methods are faster but less aligned with prediction features.
Model-agnostic approaches are flexible but computationally costly.
Abstract
Large language models (LLMs) achieve strong performance across many natural language processing tasks, yet their decision processes remain difficult to interpret. This lack of transparency creates challenges for trust, debugging, and deployment in real-world systems. This paper presents an applied comparative study of three explainability techniques: Integrated Gradients, Attention Rollout, and SHAP, on a fine-tuned DistilBERT model for SST-2 sentiment classification. Rather than proposing new methods, the focus is on evaluating the practical behavior of existing approaches under a consistent and reproducible setup. The results show that gradient-based attribution provides more stable and intuitive explanations, while attention-based methods are computationally efficient but less aligned with prediction-relevant features. Model-agnostic approaches offer flexibility but introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
