TL;DR
This paper introduces CEViT, a Vision Transformer-based similarity metric that enhances explainability in image similarity assessments and maintains competitive classification accuracy.
Contribution
The paper proposes CEViT, a novel case-enhanced ViT model that improves interpretability of image similarity and integrates with k-NN classification.
Findings
CEViT achieves accuracy comparable to state-of-the-art models.
CEViT provides case-influenced explanations of image similarity.
Preliminary results show promising explainability improvements.
Abstract
This short paper presents preliminary research on the Case-Enhanced Vision Transformer (CEViT), a similarity measurement method aimed at improving the explainability of similarity assessments for image data. Initial experimental results suggest that integrating CEViT into k-Nearest Neighbor (k-NN) classification yields classification accuracy comparable to state-of-the-art computer vision models, while adding capabilities for illustrating differences between classes. CEViT explanations can be influenced by prior cases, to illustrate aspects of similarity relevant to those cases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsByte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Softmax · Attention Is All You Need · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Dense Connections
