Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

Kristoffer Wickstr{\o}m; Teresa Dorszewski; Siyan Chen; Michael Kampffmeyer; Elisabeth Wetzer; Robert Jenssen

arXiv:2512.17891·cs.CV·December 22, 2025

Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

Kristoffer Wickstr{\o}m, Teresa Dorszewski, Siyan Chen, Michael Kampffmeyer, Elisabeth Wetzer, Robert Jenssen

PDF

Open Access 4 Reviews

TL;DR

This paper introduces Keypoint Counting Classifiers (KCCs), a method to convert pre-trained Vision Transformer models into self-explainable models without retraining, enhancing transparency and interpretability in vision tasks.

Contribution

The paper proposes a novel approach to make ViT-based models self-explainable without retraining, leveraging keypoint matching for interpretability.

Findings

01

KCCs improve human-machine communication.

02

KCCs provide visualizable explanations.

03

KCCs enhance model transparency and reliability.

Abstract

Current approaches for designing self-explainable models (SEMs) require complicated training procedures and specific architectures which makes them impractical. With the advance of general purpose foundation models based on Vision Transformers (ViTs), this impracticability becomes even more problematic. Therefore, new methods are necessary to provide transparency and reliability to ViT-based foundation models. In this work, we present a new method for turning any well-trained ViT-based model into a SEM without retraining, which we call Keypoint Counting Classifiers (KCCs). Recent works have shown that ViTs can automatically identify matching keypoints between images with high precision, and we build on these results to create an easily interpretable decision process that is inherently visualizable in the input. We perform an extensive evaluation which show that KCCs improve the…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- The paper introduces a training-free method for generating post-hoc explanations for ViT models. The proposed keypoint-based visualization offers an alternative to common explanation modalities like heatmaps, making the work relevant to ongoing efforts in XAI. - The authors conduct a user study to compare their proposed visualization against baseline methods, and the results suggest a user preference for their approach in terms of perceived quality. The quantitative analysis benchmarks the met

Weaknesses

1. **Contradiction in "Self-Explainability" and Lack of Faithfulness:** A major conceptual weakness arises from the reliance on external models (SAM, Grounding DINO). By forcing a foreground mask, the method is not explaining the original ViT's decision but rather a constrained decision within a pre-processed input. This compromises the faithfulness of the explanation, as it cannot reveal if the model relies on spurious background correlations for its prediction. Furthermore, depending on extern

Reviewer 02Rating 2Confidence 3

Strengths

- The goal of introducing a training-free self-explainability method for ViTs is novel and directly addresses the inflexibility of prior SEMs that require specific architectures or costly retraining. - The user study using different metrics for comparing user experiences between different explainability methods was well-designed. - The paper is clearly written and the method description is easy to follow.

Weaknesses

- The paper's goal is to create a self-explainable method where the decision process is "inherently explainable", however it relies on external, non-explicitly explainable segmentation models (SAM and Grounding DINO) at the very beginning of the self-explanation pipeline for foreground segmentation. The work would benefit from an argument as to why additional non-explicitly explainable models can be used as part of the self-explanation pipeline. - As noted in Section 4, this method quickly becom

Reviewer 03Rating 2Confidence 5

Strengths

The idea is interesting and the approach novel with small disclaimer about usage in this context textual information [8] and correspondence to object parts [2]. I would see it more as generalization of those ideas, which is still valid and novel contribution. However, not groundbreaking. The introduction is clearly written. Other parts a bit less, see weaknesses. The paper tackles important topic and is significant for SEM models.

Weaknesses

"despite the fact that many studies have pointed out limitations associated with both bounding boxes and heatmaps" - none of the study showcasing that are referenced. It is not detailed whether heatmaps of SEMs are critized or heatmaps in general. Statements are too big and too vague. "To improve the usability of SEMs, new methods for visualizing explanations must be developed", work on the improvements of such models exist [1], [2], [3] and should be referenced. This sentence makes the reader

Reviewer 04Rating 2Confidence 4

Strengths

* **S1 (Tackling an Important Problem)**: The paper focuses on making ML models easier to understand by creating models that can explain their decisions without needing extra training. * **S2 (Novel Idea)**: The method relies on existing methodologies, however, using keypoints to make a ViT model a SEM is a new approach.

Weaknesses

* **W1 (Clarity of the Methodological Description)**: The explanation of the proposed method could be enhanced to improve readability and reproducibility. The notation in the methodology section is at times inconsistent or undefined. For instance, in Equation 1, S(j) and Z are used without a clear definition. A more detailed and systematic explanation of the mathematical formulations would greatly benefit the paper. * **W2 (Details of the Implementation)**: In lines 164-166, the paper states th

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis