Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning
Neha Kalibhat, Priyatham Kattakinda, Sumit Nawathe, Arman Zarei, Nikita Seleznev, Samuel Sharpe, Senthil Kumar, Soheil Feizi

TL;DR
This paper investigates the use of semantically meaningful tokens derived from segmentation and scene graphs to improve vision transformer representations, leading to significant gains in retrieval and compositionality tasks.
Contribution
It introduces a novel tokenization approach using tangible and intangible tokens and a new attention mechanism, enhancing vision transformer performance.
Findings
47% improvement in text-to-image retrieval
44% improvement in image-to-text retrieval
Notable gains on compositionality benchmarks
Abstract
Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Intelligent Tutoring Systems and Adaptive Learning · Advanced Text Analysis Techniques
MethodsTanh Activation
