
TL;DR
ConRAT is a novel self-interpretable model that extracts human-aligned concepts from text and explains predictions through a linear combination of these concepts, improving interpretability and performance.
Contribution
It introduces ConRAT, a new model that generates concept-based explanations aligned with human rationalization using only overall labels.
Findings
ConRAT produces concepts that align with human rationalizations.
It outperforms state-of-the-art methods on sentiment classification.
ConRAT enhances interpretability without requiring aspect-specific labels.
Abstract
Automated predictions require explanations to be interpretable by humans. One type of explanation is a rationale, i.e., a selection of input features such as relevant text snippets from which the model computes the outcome. However, a single overall selection does not provide a complete explanation, e.g., weighing several aspects for decisions. To this end, we present a novel self-interpretable model called ConRAT. Inspired by how human explanations for high-level decisions are often based on key concepts, ConRAT extracts a set of text snippets as concepts and infers which ones are described in the document. Then, it explains the outcome with a linear aggregation of concepts. Two regularizers drive ConRAT to build interpretable concepts. In addition, we propose two techniques to boost the rationale and predictive performance further. Experiments on both single- and multi-aspect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
