Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Yale Song; Mohammad Soleymani

arXiv:1906.04402·cs.CV·July 18, 2019·22 cites

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Yale Song, Mohammad Soleymani

PDF

Open Access 1 Repo

TL;DR

This paper introduces PIE-Nets, a novel approach for visual-semantic embedding that generates multiple representations for polysemous instances, improving cross-modal retrieval for images and videos, including a new dataset for video-text tasks.

Contribution

The paper proposes a multi-head self-attention based embedding method for polysemous instances and introduces a new large-scale video-text dataset for retrieval tasks.

Findings

01

PIE-Nets outperform existing methods on image-text retrieval.

02

Effective handling of polysemous instances improves retrieval accuracy.

03

New dataset enables research in video-text retrieval.

Abstract

Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Unfortunately, injective embedding cannot effectively handle polysemous instances with multiple possible meanings; at best, it would find an average representation of different meanings. This hinders its use in real-world scenarios where individual instances and their cross-modal associations are often ambiguous. In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. To learn visual-semantic embedding, we tie-up two PIE-Nets and optimize them jointly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yalesong/pvse
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning