Case-Enhanced Vision Transformer: Improving Explanations of Image   Similarity with a ViT-based Similarity Metric

Ziwei Zhao; David Leake; Xiaomeng Ye; David Crandall

arXiv:2407.16981·cs.CV·July 25, 2024

Case-Enhanced Vision Transformer: Improving Explanations of Image Similarity with a ViT-based Similarity Metric

Ziwei Zhao, David Leake, Xiaomeng Ye, David Crandall

PDF

1 Repo

TL;DR

This paper introduces CEViT, a Vision Transformer-based similarity metric that enhances explainability in image similarity assessments and maintains competitive classification accuracy.

Contribution

The paper proposes CEViT, a novel case-enhanced ViT model that improves interpretability of image similarity and integrates with k-NN classification.

Findings

01

CEViT achieves accuracy comparable to state-of-the-art models.

02

CEViT provides case-influenced explanations of image similarity.

03

Preliminary results show promising explainability improvements.

Abstract

This short paper presents preliminary research on the Case-Enhanced Vision Transformer (CEViT), a similarity measurement method aimed at improving the explainability of similarity assessments for image data. Initial experimental results suggest that integrating CEViT into k-Nearest Neighbor (k-NN) classification yields classification accuracy comparable to state-of-the-art computer vision models, while adding capabilities for illustrating differences between classes. CEViT explanations can be influenced by prior cases, to illustrate aspects of similarity relevant to those cases.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ziweizhao1993/cevit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsByte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Softmax · Attention Is All You Need · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Dense Connections