Vision Transformers with Natural Language Semantics

Young Kyung Kim; J. Mat\'ias Di Martino; Guillermo Sapiro

arXiv:2402.17863·cs.CV·February 29, 2024·1 cites

Vision Transformers with Natural Language Semantics

Young Kyung Kim, J. Mat\'ias Di Martino, Guillermo Sapiro

PDF

Open Access

TL;DR

This paper introduces Semantic Vision Transformers (sViT), a novel model that incorporates semantic information into tokens, improving interpretability, robustness, and efficiency over traditional ViT models by leveraging segmentation-based tokenization.

Contribution

sViT is the first transformer model to integrate semantic segmentation for tokenization, enhancing interpretability, data efficiency, and robustness in vision tasks.

Findings

01

sViT outperforms ViT with less training data

02

sViT shows superior out-of-distribution generalization

03

Semantic tokens improve model interpretability

Abstract

Tokens or patches within Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP). Typically, ViT tokens are associated with rectangular image patches that lack specific semantic context, making interpretation difficult and failing to effectively encapsulate information. We introduce a novel transformer model, Semantic Vision Transformers (sViT), which leverages recent progress on segmentation models to design novel tokenizer strategies. sViT effectively harnesses semantic information, creating an inductive bias reminiscent of convolutional neural networks while capturing global dependencies and contextual information within images that are characteristic of transformers. Through validation using real datasets, sViT demonstrates superiority over ViT, requiring less training data while maintaining similar or superior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Infrared Target Detection Methodologies · Visual Attention and Saliency Detection