Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers
Sungmin Han, Jeonghyun Lee, Sangkyun Lee

TL;DR
Contrast-CAT introduces a novel activation contrast-based attribution method that improves interpretability of transformer-based text classifiers by filtering out irrelevant features, leading to clearer and more faithful explanations.
Contribution
It proposes Contrast-CAT, a new method that enhances interpretability by contrasting activations with references, outperforming existing attribution techniques in transformer models.
Findings
Contrast-CAT achieves 1.30x improvement in AOPC under MoRF.
It attains 2.25x higher LOdds compared to state-of-the-art methods.
Experimental results confirm its effectiveness across datasets and models.
Abstract
Transformers have profoundly influenced AI research, but explaining their decisions remains challenging -- even for relatively simpler tasks such as classification -- which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
