What Remains of Visual Semantic Embeddings
Yue Jiao, Jonathon Hare, Adam Pr\"ugel-Bennett

TL;DR
This paper evaluates how well current visual semantic embedding models encode semantic information in zero-shot learning, introducing a fair benchmark and revealing their limitations in capturing semantic relationships.
Contribution
It introduces a new ZSL benchmark using split tiered-ImageNet and a unified contrastive learning framework to fairly evaluate semantic encoding capabilities.
Findings
Current ZSL models struggle with semantic relationships.
The new benchmark avoids structural flaws of standard ImageNet.
Encourages exploration of contextual language representations in ZSL.
Abstract
Zero shot learning (ZSL) has seen a surge in interest over the decade for its tight links with the mechanism making young children recognize novel objects. Although different paradigms of visual semantic embedding models are designed to align visual features and distributed word representations, it is unclear to what extent current ZSL models encode semantic information from distributed word representations. In this work, we introduce the split of tiered-ImageNet to the ZSL task, in order to avoid the structural flaws in the standard ImageNet benchmark. We build a unified framework for ZSL with contrastive learning as pre-training, which guarantees no semantic information leakage and encourages linearly separable visual features. Our work makes it fair for evaluating visual semantic embedding models on a ZSL setting in which semantic inference is decisive. With this framework, we show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling
MethodsContrastive Learning
