Language Models as Zero-shot Visual Semantic Learners
Yue Jiao, Jonathon Hare, Adam Pr\"ugel-Bennett

TL;DR
This paper demonstrates that transformer-based language models can be effectively used as zero-shot visual semantic learners, outperforming static embeddings in complex scene understanding and novel category association.
Contribution
It introduces the Visual Semantic Embedding Probe (VSEP) to leverage contextualized language model knowledge for visual semantic tasks, highlighting their advantages over static embeddings.
Findings
Contextualized embeddings outperform static ones in short object chains.
VSEP effectively distinguishes object representations in complex scenes.
Current VSE models lack mutual exclusivity bias, limiting performance.
Abstract
Visual Semantic Embedding (VSE) models, which map images into a rich semantic embedding space, have been a milestone in object recognition and zero-shot learning. Current approaches to VSE heavily rely on static word em-bedding techniques. In this work, we propose a Visual Se-mantic Embedding Probe (VSEP) designed to probe the semantic information of contextualized word embeddings in visual semantic understanding tasks. We show that the knowledge encoded in transformer language models can be exploited for tasks requiring visual semantic understanding.The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner. We further introduce a zero-shot setting with VSEPs to evaluate a model's ability to associate a novel word with a novel visual category. We find that contextual representations in language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
