Explaining Classes through Word Attribution
Samuel R\"onnqvist, Amanda Myntti, Aki-Juhani Kyr\"ol\"ainen, Sampo, Pyysalo, Veronika Laippala, Filip Ginter

TL;DR
This paper introduces a method to explain how deep learning models perceive classes in text classification by aggregating individual prediction explanations, effectively identifying key class-specific keywords.
Contribution
The study presents a novel approach combining Integrated Gradients with aggregation to generate class-level explanations in text classification tasks.
Findings
The method successfully identifies meaningful keywords for most classes.
It works well on Web register classification with the XML-R model.
Small classes may have less discriminative keyword explanations.
Abstract
In recent years, several methods have been proposed for explaining individual predictions of deep learning models, yet there has been little study of how to aggregate these predictions to explain how such models view classes as a whole in text classification tasks. In this work, we propose a method for explaining classes using deep learning models and the Integrated Gradients feature attribution technique by aggregating explanations of individual examples in text classification to general descriptions of the classes. We demonstrate the approach on Web register (genre) classification using the XML-R model and the Corpus of Online Registers of English (CORE), finding that the method identifies plausible and discriminative keywords characterizing all but the smallest class.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
