Vision Transformers with Natural Language Semantics
Young Kyung Kim, J. Mat\'ias Di Martino, Guillermo Sapiro

TL;DR
This paper introduces Semantic Vision Transformers (sViT), a novel model that incorporates semantic information into tokens, improving interpretability, robustness, and efficiency over traditional ViT models by leveraging segmentation-based tokenization.
Contribution
sViT is the first transformer model to integrate semantic segmentation for tokenization, enhancing interpretability, data efficiency, and robustness in vision tasks.
Findings
sViT outperforms ViT with less training data
sViT shows superior out-of-distribution generalization
Semantic tokens improve model interpretability
Abstract
Tokens or patches within Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP). Typically, ViT tokens are associated with rectangular image patches that lack specific semantic context, making interpretation difficult and failing to effectively encapsulate information. We introduce a novel transformer model, Semantic Vision Transformers (sViT), which leverages recent progress on segmentation models to design novel tokenizer strategies. sViT effectively harnesses semantic information, creating an inductive bias reminiscent of convolutional neural networks while capturing global dependencies and contextual information within images that are characteristic of transformers. Through validation using real datasets, sViT demonstrates superiority over ViT, requiring less training data while maintaining similar or superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Infrared Target Detection Methodologies · Visual Attention and Saliency Detection
