Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers

Sungmin Han; Jeonghyun Lee; Sangkyun Lee

arXiv:2507.21186·cs.CL·July 30, 2025

Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers

Sungmin Han, Jeonghyun Lee, Sangkyun Lee

PDF

TL;DR

Contrast-CAT introduces a novel activation contrast-based attribution method that improves interpretability of transformer-based text classifiers by filtering out irrelevant features, leading to clearer and more faithful explanations.

Contribution

It proposes Contrast-CAT, a new method that enhances interpretability by contrasting activations with references, outperforming existing attribution techniques in transformer models.

Findings

01

Contrast-CAT achieves 1.30x improvement in AOPC under MoRF.

02

It attains 2.25x higher LOdds compared to state-of-the-art methods.

03

Experimental results confirm its effectiveness across datasets and models.

Abstract

Transformers have profoundly influenced AI research, but explaining their decisions remains challenging -- even for relatively simpler tasks such as classification -- which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.